Time Series Analysis on Incidence of Pulmonary Tuberculosis With Weather Factors During 2004-2017 in Guangdong Province, China

Background: Although the occurrence of some infectious diseases including TB was found to be associated with specic weather factors, few studies have incorporated weather factors into the model to predict the incidence of tuberculosis (TB). We aimed to establish an accurate forecasting model using TB data in Guangdong Province, incorporating local weather factors. Methods: Data of sixteen meteorological variables (2003-2016) and the TB incidence data (2004-2016) of Guangdong were collected. Seasonal autoregressive integrated moving average (SARIMA) model was constructed based on the data. SARIMA model with weather factors as explanatory variables (SARIMAX) was performed to t and predict TB incidence in 2017. Results: Maximum temperature, maximum daily rainfall, minimum relative humidity, mean vapor pressure, extreme wind speed, maximum atmospheric pressure, mean atmospheric pressure and illumination duration were signicantly associated with log(TB incidence). After tting the SARIMAX model, maximum pressure at lag 6 (β= -0.007, P < 0.05, 95% condence interval (CI): -0.011, -0.002, mean square error (MSE): 0.279) was negatively associated with log(TB incidence), while extreme wind speed at lag 5 (β=0.009, P < 0.05, 95% CI: 0.005, 0.013, MSE: 0.143) was positively associated. SARIMAX (1, 1, 1) (0, 1, 1) 12 with extreme wind speed at lag 5 was the best predictive model with lower Akaike information criterion (AIC) and MSE. The predicted monthly TB incidence all fall within the condence intervals using this model. Conclusions: Weather factors have different effects on TB incidence in Guangdong. Incorporating meteorological factors into the model increased the accuracy of prediction.


Background
As a chronic infectious disease caused by Mycobacterium tuberculosis, Tuberculosis (TB) is mainly transmitted by respiratory tract and threatens human health, ranking rst among the top ten death causes in the world. The World Health Organization (WHO) estimated that ranging 8.9-11.0 million people fell ill with TB around the world in 2019 [1]. To end the world's top infectious disease killer, the WHO set a goal of reducing the morbidity and mortality of TB by 90% and 95%, respectively, between 2015 and 2035. However the incidence has just been declining very slowly in the past few years. In China, the incidence and death of TB always rank rst among class A and B level infectious diseases. In recent years, due to effective prevention and control, the epidemic situation of TB in China has been declining at an annual rate of 3% between 2005 and 2017 [2]. Meanwhile, the infectivity of TB has become stronger and stronger since 2010 [3]. Since the population base of TB infection is large, China is still one of the 22 countries with high TB burden in the world, with approximately 866 000 new cases identi ed in 2018, second only to India [1].
Chemotherapy is the main treatment of TB while there is no major breakthrough in new medication for treating TB in the past 40 years. The chemotherapy of TB is time consuming, and the chemotherapy regimen is complex. The only vaccine in use is Bacillus Calmette Guerin (BCG), which was developed almost 100 years ago and is not applicable to adults. At the same time, the double infection of Mycobacterium tuberculosis and HIV, and the emergence of drug-resistant TB cause the current TB prevention and treatment in a dilemma [4]. In addition to endeavoring to care and treat after infection, targeted publicity, prevention and control measures or policies help to cut off its spread from the source, reduce the incidence of TB effectively, and further alleviate the burden of individuals and burden of subsequent treatment. Based on this, accurately predicting the trend of this epidemic is helpful to foresee the possible peaks and provide a reference for the prevention and control of TB.
Climate change has been a contributing factor in the transmission of various of infectious diseases, especially water-borne and vector-borne infectious diseases [5]. In some recent studies the incidence of some infectious diseases including TB was associated with speci c weather factors. In one study, the relationship between the spread of SARS-Cov-2 and meteorological variables was analyzed. The result showed that the average temperature was negatively correlated with the number of SARS-Cov-2 infection cases. The precipitation was positively correlated with the spread of SARS-Cov-2. Countries with higher rainfall showed an increase in disease transmission [6]. In another study, distributed lag nonlinear model (DLNM) was used to analyze TB incidence in Jinghong, a city in Yunnan Province, and found that the average temperature was negatively correlated with the incidence of TB, with lag period 2 months; the total precipitation and the lowest relative humidity were negatively correlated, with lag period 3 months and 4 months respectively [7]. In another study, geographically weighted regression (GWR) model was used to analyze the relationship between the incidence of TB in 2005-2015 in different districts of China and local meteorological factors. It was found that the average temperature was positively correlated with the incidence of TB, while the mean relative humidity and the mean wind speed were negatively correlated [8].
Previous studies have also explored various models, such as autoregressive integrated moving average (ARIMA) [2,9], X12-ARIMA [9], ARIMA-generalized regression neural network (GRNN), DLMN [7] and GWR [8] model in predicting TB [2], but most were conducted without incorporating meteorological factors. In a study performed on three cities in Jiangsu province, the ARIMAX model was found to be superior to the ARIMA and RNN models in predicting PTB when taking weather factors into consideration [10]. The result of a study on the dengue in Guangzhou showed that an ARIMA model with imported cases and minimum temperature as input variables was superior to a single ARIMA model in forecasting dengue transmission [11]. Guangdong is the province with the largest population and high TB mobility in China, in which the incidence of TB ranks near the top in 31 provinces in the country. But to our knowledge, there hasn't been any study exploring the TB incidence pattern of this province. In the current study, we added weather factors into the SARIMAX model, explored correlations between weather factors and the incidence of TB in Guangdong, and endeavored to establish an accurate model for estimating epidemic trends pertaining to TB.

Study area
Guangdong Province is located in the coast of South China (20°13′-25°31′N, 109°39′-117°19′E), with an area of 179725 square kilometer (Fig 1). As the most populous and largest economic province province in China at present, Guangdong governed 21 prefectures and had a permanent population of 115.21 million at the end of 2019. It is situated in a zone of subtropical monsoon climate with abundant rainfall. Spring is warm, autumn and winter are relatively sunny, while summer is hot and rainy.

Data collection
Data of TB In mainland China, the TB Information Management System has been established and operated by CCDC (Chinese Centre for Disease Control and Prevention) since 2004. It is mandatory to report every single TB case through this on-line system. Monthly TB data from January 2004 to December 2017 were obtained from the Chinese public health science data centre. The TB data used used in this study was TB monthly incidence, including TB cases that was sputum smear positive, culture positive, bacteria negative and without sputum examination. The TB cases were diagnosed according to the criteria used by the National Health and Family Planning Commission of the People's Republic of China.

Data of weather factors
Monthly local weather factors in 2003 to 2017 were obtained from the China Meterorological Science Data Center.
Sixteen meteorological characteristics (minimum relative humidity, minimum atmospheric pressure, maximum atmospheric pressure, mean atmospheric pressure, mean vapor pressure, maximum wind speed, extreme wind speed, mean 2-minute wind speed, mean temperature, minimum temperature, maximum temperature, mean minimum temperature, mean maximum temperature, illumination duration, 8pm-8pm rainfall, maximum daily rainfall) were included. Means of monthly values of these meteorological characteristics were calculated from 6 meteorology station (Station number: 59082, station name: Shaoguan; station number: 59287, station name: Guangzhou; station number: 59293, station name: Dongyuan; station number: 59316, station name: Shantou; station number: 59501, station name: Shanwei; station number: 59663, station name: Yangjiang). Data in the same month and the same station during other year were averaged to ll the missing data. The used data in this study are presented in Additional Files.

Construction of SARIMA Model
The well-known ARIMA model is widely used in time series analysis for describing and predicting epidemic prevalence for its accuracy and practicality [12][13][14]. The seasonal ARIMA (SARIMA) model that is developed from the ARIMA model, incorporates seasonal period and performs better in the presence of an obvious seasonal pattern. Hence SARIMA model is optimal in this study because both seasonal and non-seasonal trends were observed. A SARIMA (p, d, q)(P, D, Q) s model has 7 parameters, in which the on-seasonal parameters includes autoregressive model order (p), number of differences (d), and moving average model order (q), and seasonal parameters includes seasonal autoregressive model order (P), number of seasonal differences (D), and seasonal moving average model order (Q). Also, the parameter s (s=12 in this study) indicates the length of the periodic pattern. An SARIMA model can be described as the following formula: In which Z t representing the value of time series at time t, and ε t a white noise series. B here refers to a backward shift operator (e.g. BZ t = Z t−1 ). ϕ(B) = 1 -ϕ 1 B -… ϕ p B p and Φ(B) = 1 -Φ 1 B s -…Φ p B Ps denote the general and seasonal auto-regressive operators respectively. θ(B) = 1 -θ 1 B -θ q B q and Θ(B) = 1 -Θ1Bs − Θ Q B Qs stand for the general and seasonal moving average operators respectively.
Based on the monthly number of PTBs during 2004-2016, we constructed an SARIMA model for Guangdong province. The steps of the model construction are described below.

Page 5/16
The rst step was stabilization processing of the sequence. The data were processed by ordinary difference and logarithmic transformation when the sequence was unstable or by seasonal difference when the sequence had seasonal distribution. The second step was model identi cation. To establish optimal SARIMA models, the values of parameters q and Q was initially identi ed by referring to the plots of the autocorrelation function (ACF) while the order of p and P was determined by referring to the plots of the partial autocorrelation function (PACF) of the stabilized series. The parameter q could not exceed the lag order at which the ACF cut off and the parameter p did not exceed the lag order at which the PACF cut off. Also, the upper limits of seasonal parameter Q and P were the number lag order (where the sample ACF and PACF values exceeded the critical values respectively) divided by 12.
Third, the model parameters were estimated and veri ed. The maximum likelihood method were used to estimate the coe cients in the model and t-tests was then applied to test the signi cance of coe cients. Akaike Information Criterion (AIC) values were then used to measure the model t (smaller AIC values indicated better model t). Fourthly, model diagnostics were performed. The Ljung-Box Q test was conducted to ascertain the degree to which the residual series of the model was demonstrated to be white noise. Only P value more than 0.05 suggested that the residual series was white noise and that all its information was adequately extracted. Finally, the optimal SARIMA model was determined through the above steps.

Construction of SARIMAX Model and Prediction of TB incidence
SARIMAX extends the capability of the SARIMA model by integrating external factors, such as temperature, rainfall, and other meteorological factors, into the time series model. The SARIMAX model was built to evaluate the relationship between monthly TB incidence and meteorological factors, and then to forecast the TB incidence.
An SARIMAX model can be described as follows: In this model, Y t represents the value of time series at time t and X t is a covariant time series that, hopefully, could help explain or forecast Y t . Z t satis es Eq (1) and β i refers to the coe cient. The expressions m 1 and m 2 are the lower and upper limits of the lag respectively.
An SARIMAX model was constructed based on meteorological variables and the optimal SARIMA model built above. The steps of the model construction are described below. Firstly, since it was di cult to assess the dependence between the two processes with strongly autocorrelation, data of monthly TB incidence and monthly meteorological variables were prewhitened by the tted optimal SARIMA model in the previous section to separate the linear associations from their autocorrelation for both incidence data and meteorological data. On this basis, cross-correlation function (CCF) plots were then used to evaluate the relationship between the TB incidence and weather factors at different lag times. Weather factors which were signi cantly cross-correlated with incidence at speci c lags were possible covariant for the model. Secondly, we incorporated the covariant into the model and repeated steps three (parameters estimation and veri cation) and step four (model diagnostics) in the previous section to build the best SARIMAX model. Only the covariant which had signi cant parameter estimates and random residual series and lowered AIC value were selected. We used the 2004-2016 data to establish the best SARIMA model and SARIMAX model. We then used the 2017 data to validate the predictive effects of the SARIMAX model.
The package "TSA" in R 4.0.3 (https://www.r-project.org/) was used to construct the SARIMAX model. The signi cance level was set to be 0.05.

Descriptive analysis
The annual averaged TB incidence between 2004 and 2017 of Guangdong was 86.85/100 000. As shown in Figure 2, the peak of incidence occurred in 2009 with annual incidence of 109.70/100 000, and the lowest incidence occurred in 2016 with annual incidence of 71.82/100 000 with a decreasing TB trend during the years included in this study. The seasonal variation is obvious in Figure 3. There was a peak of incidence in the spring (March to May) and a trough in winter (November to December). More descriptive statistics about the weather factors and TB incidence were shown in Table 1.  (Figure 2). A logarithmic transformation of the time series of TB incidence was done to stabilize uctuations in the time series data. As the time series plot of the logarithm of TB incidence showed a time-dependent trend and an obvious seasonal distribution, 1-step non-seasonal and 1-step seasonal differences were applied separately. In this case, the values for d and D were 1. As shown in Figure 4, the ACF values of lag 1, 11 and 12 exceeded the critical value. The ACF value of lag 11, the neighbor of seasonal lag 12, was caused by the cross effect of the seasonal and non-seasonal autocorrelation. Therefore, the maximum values of the seasonal parameter, Q, and the non-seasonal parameter, q, were 1 and 1, respectively. Similarly, the PACF values were signi cant at lag 1, 2, 11, 12 and 37 (Figure 4), so the maximum values of the seasonal parameter, P, and the non-seasonal parameter, p, were 3 and 2, respectively. We assumed that the maximum values of P was 1 to make the model concise. We searched all 24 SARIMA models that satis ed the conditions p ≤ 2 , P ≤ 1, q ≤ 1, and Q ≤ 1 to nd the most suitable model. The parameters of only ve models were signi cant. Table 2 shows the results for these ve models: Model A (SARIMA (0, 1, 1) (0, 1, 1) 12  other models. Therefore, the SARIMA (0, 1, 1) (0, 1, 1) 12 model t the data better.

SARIMAX model analysis and prediction
The CCF was used to explore the relationship between weather factors and TB incidence. The tted SARIMA model was applied to prewhiten the data of monthly TB incidence and the monthly averaged values of the weather factors. Figure 5 shows the cross-correlation between the prewhitened weather variables and log(TB incidence) at lags of 0 to 6 months. Only positive lags would be considered because the positive value indicated that meteorological factors could affect TB incidence a certain period of time later. The weather factors, maximum temperature, maximum daily rainfall, minimum relative humidity, mean vapor pressure, extreme wind speed, maximum atmospheric pressure, mean atmospheric pressure and illumination duration, at different lags were signi cantly associated with log(TB incidence). For example, the CCF for minimum relative humidity and log(TB incidence) was signi cant at lag 1 and lag2. The CCF for maximum daily rainfall and log(TB incidence) was signi cant at lag 5 and lag 6. Those signi cant weather factors were added as covariant into the SARIMA model to establish the SARIMAX model. First, the model was tested with single lagged weather factor. As is shown in Table 3, two SARIMAX models with covariant had signi cant parameters. Afterwards these two parameters were added together as covariant to build the SARIMAX model with multiple independent variables. The result indicated that maximum atmospheric pressure at lag 6, extreme wind speed at lag 5 as well as their combination affected log(TB incidence) after tting the time series regression model, but their combination failed to have signi cant parameters. Maximum atmospheric pressure at lag 6 (β= -0.007, P < 0.05, 95% con dence interval (CI): -0.011, -0.002, mean square error (MSE): 0.279) was negatively associated with log(TB incidence), while extreme wind speed at lag 5 (β= 0.009, P < 0.05, 95% CI: 0.005, 0.013, MSE: 0.143) was positively associated. SARIMAX (1, 1, 1) (0, 1, 1) 12 extreme wind speed at lag 5 was the optimal model with the lowest AIC value and MSE (Table 4).
Based on the SARIMAX model constructed above, this study attempted to predict TB incidence from January 2017 to December 2017 in Guangdong. The estimated and predicted results are shown in Figure 6. The predicted monthly numbers of TB incidence all fall within the con dence intervals.  Discussion Seasonality uctuation is a common phenomenon in the incidence of many infectious diseases including TB. In this study, a clear seasonality in the time series of TB incidence in Guangdong was found. TB incidence peaked in the spring (March to May). This result is consistent with what was found in the study by Wang et al. that TB incidence from January 1997 to August 2019 of China predominantly peaks in spring and early summer [15]. In another study of the epidemiology of TB in Xinjiang by Wubuli et al., big peaks and trough of TB were found in March and in October respectively [16]. Despite common peak and trough observed, there were also slight variations in seasonality pattern between different studies on various regions, which might arise from meteorological pattern diversity of regions involved and study periods discrepancy involved between study. The mechanisms underlying seasonal periodicity remain poorly understood, while the oscillatory changes in infectiousness, contact patterns, pathogen survival, host susceptibility, population behaviors and meteorological factors may contribute to this phenomenon [17]. For TB, strong wind and increased outdoor activities in spring may play signi cant role in this seasonality pattern.
This study examined the association between weather variables and TB. We found that the weather factors including maximum temperature, maximum daily rainfall, minimum relative humidity, mean vapor pressure, extreme wind speed, maximum atmospheric pressure, mean atmospheric pressure and illumination duration were signi cantly associated with log(TB incidence). Additionally, extreme wind speed and maximum atmospheric pressure were tted into the model. Extreme wind speed at lag 5 was positively associated with log(TB incidence), and maximum atmospheric pressure at lag 6 was positively associated. SARIMAX (0, 1, 1) (0, 1, 1) 12 with extreme wind speed at lag 5 as covariant was the optimal model with lower AIC and highest prediction accuracy (lower MSE than model without weather factors as covariant or model with maximum atmospheric pressure as covariant), which overcame the assumption of linear dependence of variables in traditional time series model and improved the accuracy of the prediction.
Our results are similar to several previous studies on the effects of meteorological factors on TB in China. Xiao et al. [7] used DLNM to analyze the 10-year TB surveillance data in Jinghong, a city in Yunnan Province. After controlling the autocorrelation, the average temperature was negatively correlated with the incidence of TB, with lag period of 2 months; the total precipitation and the lowest relative humidity were negatively correlated, with lag period of 3 months and 4 months respectively, and there was no lag in the effect of the mean wind speed and total sunshine hours on TB incidence. Zhang et al. [8] used GWR model to analyze the incidence of TB in 2005-2015 in different districts of the country and local meteorological factors. It was found that the average temperature was positively correlated with the incidence of TB, while the mean relative humidity and the mean wind speed were negatively correlated. The different lag effects of weather variables in other studies probably resulted from the differences between study locations involved.
Different from the normal mean value of weather factors which was found to be correlated with TB incidence in other studies, the maximum value of weather factors represents the occurrence of more abnormal weather.
Abnormal weather will lead to the decline of host resistance to pathogen, which is closely related to the occurrence of infectious diseases. The possible link between TB and weather factors may be attributable to the following reasons: We found extreme wind speed at lag 5 was positively associated with log(TB incidence). The higher the wind speed, the greater the possibility of disease transmission through respiratory droplets [18]. This result is consistent with the characteristic that TB incidence was high in spring when wind speed was high in Guangdong.
We also found maximum atmospheric pressure at lag 6 was positively associated with TB incidence. Air ow usually occurs from high-pressure areas to low-pressure areas, so the correlation between TB and atmospheric pressure may be related to wind speed. However, the mechanism by which pressure affects the transmission of TB virus is poorly understood. Additional studies are warranted to further delineate the underlying mechanisms.
In present study, maximum temperature was also found to be signi cantly associated with log(TB incidence). As for temperature, it can affect the indoor and outdoor activities of TB patients and other susceptible people. For example, temperature was positively associated with the number of individuals walking on the track except extreme high temperature [19,20]. Frequent outdoor activities may increase the risk of infection with TB.
Except for the above, minimum relative humidity, mean vapor pressure and maximum daily rainfall and illumination duration of different lags were also found signi cantly associated with log(TB incidence). The survival of viruses depends partially on levels of relative humidity. Viruses with lipid envelopes will tend to survive longer at lower (20-30%) RHs [21]. Continuous exposure to dry air may reduce the production of protective mucus on the surface of respiratory tract, thus weakening its resistance to the pathogen [22]. In one study, precipitation, atmospheric pressure, and relative humidity were found to have negative effects on TB incidence by indirectly lowering the concentrations of inhalable particulate matter and sulfur dioxide. And TB incidence was found to be negatively correlated with the concentration of inhalable particulate matter, sulfur dioxide, or nitrogen dioxide [23].
Also, the large amount of ultraviolet light provided by long-term sunshine not only restricts the growth of M.
tuberculosis but also promotes the synthesis of vitamin D, which can protect people from TB to some extent [24].
The ARIMA model, also known as the Box-Jenkins model, can analyze various types of time series data and is a commonly used model in time series analysis [25][26][27]. Unlike the ARIMA model, which is a univariate time series model, the ARIMAX model can deal with multivariate time series data. It adds other variables related to the target series as input variables to improve the prediction accuracy. Previous studies have explored various models, such as ARIMA [2,9], X12-ARIMA [9], ARIMA-generalized regression neural network (GRNN), DLMN [7] and DWR [8] in predicting TB [2]. However, few models have considered seasonal variation characteristics and meteorological factors [28][29][30]. In a study performed on three cities in Jiangsu province, the ARIMAX model was found to be superior to the ARIMA and RNN models in predicting PTB when taking meteorological factors into consideration [10]. A time series study in Guangzhou, China, showed that an ARIMA model with imported cases and minimum temperature as input variables was superior to a single ARIMA model in forecasting dengue transmission [11].
Another time series study in Abidjan, Coted' Ivoire, also indicated that including rainfall as an input variable can increase the accuracy of the ARIMA model in predicting infuenza [31]. In the present study, we found the addition of meteorological factors decreased MSE of SARIMA model without covariant and improved the prediction accuracy (Table 4). Both ARIMA and ARIMAX are linear regression models. Considering that this relationship may be nonlinear and possess the lag time, distributed lag nonlinear model and long-term prediction model might be applied in the future studies .
This study is not free from limitations. One limitation was that only monthly TB incidence data were available.
Weekly or daily incidence data may decrease the accuracy of lagged time estimation. Secondly, this model was established based on TB incidence and meteorological data in one single study area Guangdong during 13 years and the prediction was conducted for only one year period. Therefore, it is only suitable to predict the overall trends in Guangdong. Then, other than local climate conditions alone, the result might be affected by other potentially confounding variables such as vaccine usage, improvement in medical care, population growth, economic development, populations, and ecological characteristics. However, these data were not available for assessment in the current study. Therefore, comparisons among different places were necessary to attain results free from confounding factors. Extension of studying term, extension of prediction term, continuous data collection, acquisition and ltering of confounding variables, as well as prediction method improvement and update are necessary to enhance the t degree of the model and verify the prediction accuracy. These hopefully can be ful lled in future studies. Lastly, the long-term effect of weather factors on TB is only calculated by mathematical methods, and its biological mechanism is not clear. It is hoped that future studies will reveal the mechanism of this delayed impact and help to properly interpret the results.

Conclusions
In conclusion, SARIMAX model was successfully applied to the time series data on TB incidence in Guangdong. To our knowledge, this is the rst time weather factors were added to the SARIMAX model for prediction of TB in Guangdong. The effectiveness of the model can be re ected in the tting and predicting accuracy. In this sense, health practitioners and other relative departments might bene t from this research and take full consideration of weather factors as well as the lag effects so as to rationally allocate health resources, intervene in possible TB epidemics, and formulate corresponding healthcare strategies. With this study, earlier public health measures can be taken to effectively prevent and control the occurrence of TB. Additionally, our results suggested that it is possible to expand the application of SARIMAX model in TB and other infectious diseases prediction in different regions in the future.