Early Warning Evaluation of Tuberculosis and Meteorological Factors in Shanxi Province based on Dynamic Bayesian Network

Tuberculosis is a major global public health problem. However, there haven’t been reported the study of meteorological factors related to the incidence of tuberculosis in Shanxi Province. Therefore, it is very urgent to establish an early warning system that easily operate of tuberculosis. The epidemiological characteristics of tuberculosis in Shanxi Province were described, and the Dynamic Bayesian Network early warning model was established by time series cross-correlation analysis and Bayesian Network. The incidence showed an overall downward trend from 2008 to 2017 with certain seasonal characteristics. Based on cross-correlation analysis, it is reasonable to use Dynamic Bayesian model �tting with meteorological factors lagging for 2 months. Comparison of classi�cation and recognition performance of the Dynamic Bayesian Network, Bayesian Network and support vector machine model shows that Dynamic Bayesian Network has the highest classi�cation accuracy in the two regions. In Shanxi Province, tuberculosis cluster in time, space and time and space, and incidence peak is in spring and early summer, seven meteorological factors are the main factors affecting the incidence of tuberculosis. The classi�cation and recognition performance of the Dynamic Bayesian Network early warning model of tuberculosis-meteorological factors is signi�cantly better than the others, and can better predict the future.


Background
Tuberculosis (TB), one of the deadliest infectious diseases in the world, is a major global public health problem [1][2][3] .Although China has made great achievements in tuberculosis control 4 , it continues to be the third-largest TB burden country and not optimistic in the world 5 , with obvious regional differences and a serious situation in the central, western and poor regions 6,7 .But in recent years, with the increase of population and spread of multidrug-resistant tuberculosis, tuberculosis has been paid more and more attention all over the world 3,8 .The aspiring strategy of WHO to end TB aims to reduce TB incidence and mortality in 2035 by 90%, and 95%, respectively compared to the 2015 cases 3,9 .
Situated in the center of China, Shanxi Province is an economically underdeveloped area, which TB incidence is not balanced in each city.At present, it is still in the exploratory stage that the study of the meteorological factors related to the incidence of tuberculosis in Shanxi Province.Therefore, it is very urgent to establish an early warning system that easily operate of tuberculosis.
This study is aimed to explore the in uence of meteorological factors on tuberculosis in Shanxi Province, to understand the incidence and epidemic characteristics of tuberculosis in Shanxi Province in the past ten years.So as to lay a foundation for the establishment of a dynamic Bayesian early warning model of tuberculosis-meteorological factors and to provide the decision-making basis for the prevention and control strategy of tuberculosis.Some studies have pointed out that the incidence of tuberculosis has certain seasonal distribution characteristics 10,11 , and this study found that tuberculosis in Shanxi Province also has similar characteristics, which speculates that the occurrence of tuberculosis may be related to speci c climatic conditions 12 .
The Bayesian Network (BN) proposed by Judea Pearl in the 1980s, but it is di cult for the traditional BN to re ect the in uence of the time factor on the result of the whole event 13 , while the Dynamic Bayesian Network (DBN) is an extension of the static BN in the time dimension [14][15][16] , it continues the advantages of static Bayes and makes the reasoning process more continuous and cumulative.The traditional epidemiological analysis mainly studies the distribution characteristics of the disease in time, space and population 6, 17 , while time series analysis can determine whether the occurrence of the disease has periodicity and peak period, and can predict the incidence 18 .DBN has great potential for data mining applications 19 , but it has few in medicine, which are mainly in gene regulatory Networks.In 2010, Lingling Ge studied the construction method of gene regulatory Network based on dynamic Bayesian model.And In 2014, Yi Zhou improved it.
If the trend of tuberculosis incidence can be predicted by scienti c methods, the purpose of effective early warning can be achieved and unnecessary losses can be reduced.Therefore, the DBN early warning model was established based on the tuberculosis incidence and meteorological data from January 2008 to December 2017 in Shanxi Province to make a reasonable prediction of tuberculosis incidence.
The data used in this study is a group of repeated measurements data of monitoring objects at different time points.The time series analysis commonly used in this data, which mainly constructed by time series analysis method at present.The DBN model is one of the effective models, it is accepted by more and more people because of its ability in generality and data mining 19 .

Clustering of Panel data
Panel data is three-dimensional dynamic data, which needs to be reduced before clustering.When dividing the risk areas of the disease, all indicators should be taken into account to cluster, the results can be seen in the Figure S1 in Supplemental Materials.YangQuan was left out because of data problems.
Finally, the 10 cities according to the result of clustering can be divided into two types of risk areas, as follows: The rst region: TaiYuan, Changzhi, Jincheng, Jinzhong, Lvliang; The second region: Datong, Shuozhou, Yuncheng, Xinzhou, Linfen

Incidence and number
According to the distribution of tuberculosis incidence in Shanxi Province from 2008 to 2017, the overall incidence of tuberculosis showed a downward trend, showing a phenomenon of high in the middle and low at both ends every year (Figure .1 Tuberculosis incidence in Shanxi Province from 2008 to 2017).The number of reported cases was the lowest from January to February, and there were multiple cases of tuberculosis from March to June, which indicates a signi cant seasonal increase (Figure S2).

Risk grade
The moving percentile method was used to classify the risk grade of tuberculosis incidence, so as to explore the regularity of time aggregation.Finally, the results of TB risk classi cation in two regions of Shanxi Province from January 2011 to December 2017 are shown in Figure S3 and S4

Lag periods
Whether the lag effect of meteorological factors on the incidence of disease can be correctly inferred is the key to the success of the study.This study can be seen that the climatic conditions vary greatly from region to region.On the whole, some meteorological factor indicators and monthly incidence changes regularly in different regions; the results of time series cross-correlation analysis between meteorological indicators and monthly incidence are shown in additional le 1.According to the results, related literature and the characteristics of tuberculosis, it is more reasonable to use the meteorological factors with a lag of 2 months to t the dynamic Bayesian model.

Principal component regression analysis
To lay the establishment foundation of the dynamic Bayesian model, the principal components included in the model were selected by regression to determine the main in uencing factors.It can be seen from Supplemental Table S1 that the eigenvalues of the rst four principal components are 2.970, 1.972, 0.952 and 0.542 respectively.Until the fourth principal component, the contribution rate reaches 91.932%.According to the principle of principal component extraction, the rst four principal components are extracted.
The four principal component scores were further analyzed by multiple linear regression using stepwise (α in = 0.05 and α out = 0.15) with the data of monthly incidence of tuberculosis.
The model tting results (Table 1) show that the estimation coe cients of the rst, third and fourth principal components are statistically signi cant (P < 0.05), while the second principal component could not be included in the model.And the related meteorological factors of each principal component can be seen in the Supplemental Table S2.Therefore, all the seven preselected meteorological factors were retained for the next step of DBN modeling.
Because the DBN model can only deal with classi ed data, and the in uence factors of some continuous variables can only be discretized.In this study, we discretize the meteorological factors into 5 grades.

Structural learning
Principal Component Regression analysis showed that all seven meteorological factors were important risk factors of tuberculosis, so all of them were used to establish the DBN model.
Firstly, the model network structure is constructed by the structure learning algorithm based on Tabu Search algorithm.After that, the initial network is adjusted according to the actual situation and expert learning opinions, until the whole network structure meets the actual requirements and can have an accurate prediction of the results.The process is implemented by Weka and GeNie.
As can be seen from Fig. 2 (Fig. 2. The structure learning of Bayes network in the rst region), the risk level of TB outbreaks in the rst region is the child node of city, mean air pressure, mean temperature, monthly precipitation, mean relative humidity, mean wind velocity and sunshine hours, indicating that the risk level of TB outbreak is directly related to these in uencing factors.And the daily precipitation ≥ 0.1 mm was the parent node of sunshine hours and average relative humidity, and was also the child node of monthly precipitation, indicating that the daily precipitation ≥ 0.1mm is indirectly related to the risk level of TB outbreak through sunshine hours, average relative humidity and monthly precipitation.
It is worth noting that the risk level node has an arrow pointing to itself, which means that the risk level of the previous month has an impact on the node probability of the next time.The mean temperature is the parent node of the mean air pressure, the average relative humidity is the parent node of the average wind speed and sunshine hours, and the sunshine hours are the parent node of the average temperature, which shows that these factors are not independent of each other.They also indirectly affect the risk level of TB outbreak through this complex relationship.
According to the structure learning of Bayes network in the second region, which can be seen in the Fig. 3, the risk level of TB outbreak in the second region is the same as that in the rst region.The difference is that the number of days with daily precipitation ≥ 0.1mm is the parent node of mean air pressure and mean relative humidity, and is also the child node of monthly precipitation, indicating that the number of days with daily precipitation ≥ 0.1mm is indirectly associated with the risk level of tuberculosis outbreak through mean air pressure, mean relative humidity and monthly precipitation.The remaining nodes also have complex relationships and are not listed.

Parameter learning
The parameters learning that used expectation maximization (EM) algorithm are studied according to the Bayesian structure learning network.
After the outbreak risk classi cation of the incidence data, the data for the period from January 2011 to December 2017 were remained.A total of 300 sets of data from January 2011 to December 2015 were used as the training set for ve cities in each of the two regions for the parameters learning.Finally, 120 sets of data from January 2016 to December 2017 were used as the validation set to evaluate the effect of classi cation recognition.Through parameter learning, it is found that the distribution of meteorological factors in the two regions is different in different grades, and their prior probabilities are shown in Supplemental Table S3.
The risk levels of tuberculosis outbreaks in both regions have 6 parent nodes in this study.Taking TaiYuan as an example, assuming that the average air pressure, average temperature, monthly precipitation, average relative humidity and average wind speed at t time are all grade 1, the transfer probability of risk level nodes is shown in Table 2.

Veri cation and comparison of the warning effect of the model
Among the 420 sets of data collected in this study, 34 groups were classi ed as "outbreak" early warning, the ratio of "early warning" to "non-early warning" was about 11:1 in the rst region.And 26 groups were classi ed as "outbreak" early warning, the ratio of "early warning" to "non-early warning" was about 15:1 in the second region.In the rst and second regions, the proportion of risk level of training and validation set with "outbreak" warning was 1.6:1, and 1.4:1, respectively.
It can be seen from Table 3 that the classi cation accuracy of the three models in the two regions is all over 75% and 80%, respectively.And DBN is the highest at 95.00% and 97.5%, respectively.Other indexes such as precision, TPR, TNR, F-Measure and G-measure are also the highest in DBN.Among them, the Fmeasure value of DBN is the largest, the two regions are 0.77 and 0.84 respectively, indicating that DBN re ects the performance of minority classes better than the other two models; G-measure values are the largest, the two regions are 0.86 and 0.85 respectively, indicating that its comprehensive classi cation performance for minority and majority categories is better.

ROC analysis
DBN, BN and SVM models respectively t the two-regional data, the AUC was calculated and the statistical difference was tested, as shown in Fig. 4 and Table S4.
Except for the SVM model tted in the rst region, the ROC curve area AUC of the other models is more than 0.6.The AUC of the two regions is DBN > BN > SVM from the largest to the smallest.Pairwise comparison was carried out on the AUC of the three models in the two regions, the results show that in the rst region, the AUC of BN and SVM is statistically different from DBN, but there is no statistical difference between BN and SVM.In the second region, the AUC of the three models were statistically different from each other.
In general, for the panel data with spatial and temporal dimensions, the classi cation and recognition performance of the DBN warning model of TB-meteorological factors established in this study is signi cantly better than that of the static BN and SVM models, whether it is to distinguish the minority categories or the majority categories.

Discussion
The SARS, in uenza and the COVID-19 all showed the necessity and urgency of establishing the infectious disease early warning system 25,26 .In 2008, China o cially launched an automatic early warning system for infectious diseases based on the whole country 27 .At present, although there are many methods to establish an effective early warning system for infectious diseases, most of the traditional methods are based on historical incidence data to predict the uctuation of future incidence 28 , that is, to predict the incidence at a certain time in the future.It only studies the temporal and spatial characteristics of infectious diseases at the incidence level, and the relationship between them and other factors, such as socio-economic factors and meteorological factors, is not explored in depth 6 .If there is a period of missing historical data or large errors, it is di cult to predict.Therefore, the DBN is more suitable for the conditional probability method of uncertain inference for the lack of deterministic prediction.
At present, there have been many studies on the epidemiological characteristics of tuberculosis and the in uence of meteorological factors 29,30 .Meteorological factors are one of the important factors affecting the occurrence and spread of infectious diseases. 31Meteorological factors can not only affect immunity and health, but also affect pathogenic microorganisms in the environment, thus affecting the prevalence of infectious diseases [32][33][34] .Shanxi Province, located in central China, is a typical mountainous plateau with the level of economic development at the end of the country.In addition to the serious environmental pollution, dense population, and bad weather such as smog, which will accelerate the spread of tuberculosis.Shanxi Province had a total of 198968 registered cases of tuberculosis from 2008 to 2017, and the average reported incidence rate was 55.67/100000, which is the middle level of the country.At present, the study on the meteorological factors related to the incidence of tuberculosis in Shanxi Province is still at the exploratory stage.Therefore, this study focuses on this and describes the epidemiological characteristics of tuberculosis in Shanxi Province, to explore the relationship between the incidence of tuberculosis and meteorological factor, and to establish an early warning model of tuberculosis based on DBN.
This study found that the reported incidence of tuberculosis in Shanxi Province showed a downward trend during the decade, with an average incidence of 55.67/100,000, but the number was still large, therefore prevention and control measures should not be relaxed.It shows a phenomenon of high in the middle and low at both ends every year, with certain seasonal characteristics.March is the peak of the whole year, and spring and early summer have an obvious seasonal increase 2,6,17 , which is roughly consistent with previous studies.The results of this study show that meteorological factors have a lagging effect on the incidence of tuberculosis.Monthly average temperature, monthly precipitation and sunshine hours are positively correlated with the incidence of tuberculosis, while monthly mean air pressure is negatively correlated.Through principal component analysis, the results show that seven meteorological factors such as monthly precipitation are the main factors affecting the incidence of tuberculosis in Shanxi Province.
BN is one of the most effective theoretical models in the uncertain reasoning 19,35  (3) This study selected monthly data of 18 meteorological stations in Shanxi Province. Method: In this study, the clustering method of panel data was used to analyze the aggregation characteristics of pulmonary tuberculosis in spatial distribution 7 .The risk grade of tuberculosis incidence was divided by moving percentile method, so as to explore the regularity of time aggregation 20,21 .The meteorological data are processed by IDW interpolation to represent the real situation of cities.The level of tuberculosis incidence is related to many factors, and there is a certain lag in time 6, 12 .This paper uses time series cross-correlation analysis to determine the lag period of the in uence of meteorological factors on the incidence of tuberculosis in Shanxi Province from 2008 to 2017.The main meteorological in uencing factors of tuberculosis were judged by principal component analysis.After Bayesian structure and parameter learning, the DBN early warning model is established and compared with BN and Support Vector Machine (SVM) 22 .

Construction of DBN
It is the extension of BNs that the DBN has expanded the BN to the time dimension, so that forming a model that can deal with time series data 16 .DBN is used to describe the dynamic process of random variables, and X t represents the state of node variables at time T. We de ne Dynamic Bayesian network as DBN (B 0 ,B → ) 22 .The Bayes at the initial time obtained by taking X 0 as the node is represented by B 0 , and B→ is the Bayes fragment at the transfer network, and the node includes x t ∪x t+1 , x t represents the current state, with no parent node 23 ; x t+1 represents the state at the next time with conditional probability P (x t+1 /parent(x t+1 ).The transition probability distribution of transfer network B → is de ned as follows: For any t, the X t joint probability distribution of DBN is similar to that of BN, as follows: That's mean a DBN model de nes the probability distribution of an in nite trajectory in a dynamic stochastic process.In the process of building DBN, rstly, the related variables are extracted and the state values are determined, then the topological structure graph is established according to the dependence among variables, and nally, the conditional probability table is established according to the dependence among variables 22,24 .According to the combination of expert knowledge and data learning, this study takes into account both e ciency and practical application, and constructs the DBN model.
The purpose of DBN reasoning is to calculate the in uence of observed variable nodes on the probability distribution of other variables, the complexity is that both the evidence and the posterior probability distribution are indexed by time.The reasoning in DBN is that given an observation sequence, we can extend the network to the whole time series by copying the Bayesian segments of each time point, and then we can apply the reasoning algorithm in BN.

The measurement criteria of warning model classi cation performance
To evaluate and compare the DBN model established in this study, that is, to evaluate its classi cation effect.The evaluation indexes of classi cation effect included accuracy (ACC), sensitivity (TPR), speci city (TNR), precision, as well as comprehensive evaluation indexes F-measure, G-measure.The area under ROC curve (AUC) was compared and the Z test was used.The difference was statistically signi cant at P < 0.05.

Software
Data collation and statistical description were performed using Excel and IBM SPSS Version 24.0 IDW and plotting were implemented using R3.6.The structure learning of Bayes network in the rst region The structure learning of Bayes network in the rst region

Table 1
Model tting table of meteorological factors related to tuberculosis incidence

Table 3
. The DBN early warning model of tuberculosis-meteorological factors established in this study has a signi cantly better classi cation and recognition performance than other models for panel data with spatial and temporal dimensions.It can accumulate the law and experience of variables changing with time to better predict the future moment.Conclusion1.In this study, the monitoring data of tuberculosis in Shanxi Province from 2008 to 2017 were analyzed from the ecological perspective.The results show that the incidence of 10 years to overall a downward trend, showing a phenomenon of high in the middle and low at both ends every year, which had certain seasonal characteristics.The cluster analysis result of multi-index panel data divides 10 cities into two regions in order to model and analyze the different regions.2.The time series cross-correlation results showed that meteorological factors had a lag effect on the incidence of tuberculosis, and monthly mean temperature, monthly precipitation and sunshine hours had a positive correlation with the incidence of pulmonary tuberculosis.The results of principal component analysis show that seven meteorological factors, such as monthly precipitation, are the main factors affecting the incidence of tuberculosis in Shanxi Province.3.The DBN early warning model of tuberculosis-meteorological factors established in this study is compared with BN and SVM model.It can be seen that for the panel data with spatial and temporal characteristics DBN is superior to other models in classi cation and recognition, which can better predict the future and provide new methods for the decision-making of tuberculosis prevention and control in Shanxi Province.