Clustering of Panel data
Panel data is three-dimensional dynamic data, which needs to be reduced before clustering. When dividing the risk areas of the disease, all indicators should be taken into account to cluster, the results can be seen in the Figure S1 in Supplemental Materials. YangQuan was left out because of data problems.
Finally, the 10 cities according to the result of clustering can be divided into two types of risk areas, as follows:
The first region: TaiYuan, Changzhi, Jincheng, Jinzhong, Lvliang;
The second region: Datong, Shuozhou, Yuncheng, Xinzhou, Linfen
Incidence and number
According to the distribution of tuberculosis incidence in Shanxi Province from 2008 to 2017, the overall incidence of tuberculosis showed a downward trend, showing a phenomenon of high in the middle and low at both ends every year (Figure.1 Tuberculosis incidence in Shanxi Province from 2008 to 2017). The number of reported cases was the lowest from January to February, and there were multiple cases of tuberculosis from March to June, which indicates a significant seasonal increase (Figure S2).
Risk grade
The moving percentile method was used to classify the risk grade of tuberculosis incidence, so as to explore the regularity of time aggregation. Finally, the results of TB risk classification in two regions of Shanxi Province from January 2011 to December 2017 are shown in Figure S3 and S4
Lag periods
Whether the lag effect of meteorological factors on the incidence of disease can be correctly inferred is the key to the success of the study. This study can be seen that the climatic conditions vary greatly from region to region. On the whole, some meteorological factor indicators and monthly incidence changes regularly in different regions; the results of time series cross-correlation analysis between meteorological indicators and monthly incidence are shown in additional file 1. According to the results, related literature and the characteristics of tuberculosis, it is more reasonable to use the meteorological factors with a lag of 2 months to fit the dynamic Bayesian model.
Principal component regression analysis
To lay the establishment foundation of the dynamic Bayesian model, the principal components included in the model were selected by regression to determine the main influencing factors. It can be seen from Supplemental Table S1 that the eigenvalues of the first four principal components are 2.970, 1.972, 0.952 and 0.542 respectively. Until the fourth principal component, the contribution rate reaches 91.932%. According to the principle of principal component extraction, the first four principal components are extracted.
The four principal component scores were further analyzed by multiple linear regression using stepwise (αin = 0.05 and αout = 0.15) with the data of monthly incidence of tuberculosis.
The model fitting results (Table 1) show that the estimation coefficients of the first, third and fourth principal components are statistically significant (P < 0.05), while the second principal component could not be included in the model. And the related meteorological factors of each principal component can be seen in the Supplemental Table S2. Therefore, all the seven preselected meteorological factors were retained for the next step of DBN modeling.
Because the DBN model can only deal with classified data, and the influence factors of some continuous variables can only be discretized. In this study, we discretize the meteorological factors into 5 grades.
Table 1
Model fitting table of meteorological factors related to tuberculosis incidence
Model
|
Unstandardized Coefficients
|
Standardized Coefficients
|
t
|
p
|
b
|
SE
|
Beta
|
constant
|
4.593
|
0.052
|
-
|
89.123
|
0.000
|
z3
|
0.237
|
0.053
|
0.127
|
4.489
|
0.000
|
z1
|
-0.117
|
0.030
|
-0.111
|
-3.907
|
0.000
|
z4
|
0.226
|
0.070
|
0.092
|
3.227
|
0.001
|
Structural learning
Principal Component Regression analysis showed that all seven meteorological factors were important risk factors of tuberculosis, so all of them were used to establish the DBN model.
Firstly, the model network structure is constructed by the structure learning algorithm based on Tabu Search algorithm. After that, the initial network is adjusted according to the actual situation and expert learning opinions, until the whole network structure meets the actual requirements and can have an accurate prediction of the results. The process is implemented by Weka and GeNie.
As can be seen from Fig. 2 (Fig. 2. The structure learning of Bayes network in the first region), the risk level of TB outbreaks in the first region is the child node of city, mean air pressure, mean temperature, monthly precipitation, mean relative humidity, mean wind velocity and sunshine hours, indicating that the risk level of TB outbreak is directly related to these influencing factors. And the daily precipitation ≥ 0.1 mm was the parent node of sunshine hours and average relative humidity, and was also the child node of monthly precipitation, indicating that the daily precipitation ≥ 0.1mm is indirectly related to the risk level of TB outbreak through sunshine hours, average relative humidity and monthly precipitation.
It is worth noting that the risk level node has an arrow pointing to itself, which means that the risk level of the previous month has an impact on the node probability of the next time. The mean temperature is the parent node of the mean air pressure, the average relative humidity is the parent node of the average wind speed and sunshine hours, and the sunshine hours are the parent node of the average temperature, which shows that these factors are not independent of each other. They also indirectly affect the risk level of TB outbreak through this complex relationship.
According to the structure learning of Bayes network in the second region, which can be seen in the Fig. 3, the risk level of TB outbreak in the second region is the same as that in the first region. The difference is that the number of days with daily precipitation ≥ 0.1mm is the parent node of mean air pressure and mean relative humidity, and is also the child node of monthly precipitation, indicating that the number of days with daily precipitation ≥ 0.1mm is indirectly associated with the risk level of tuberculosis outbreak through mean air pressure, mean relative humidity and monthly precipitation. The remaining nodes also have complex relationships and are not listed.
Parameter learning
The parameters learning that used expectation maximization (EM) algorithm are studied according to the Bayesian structure learning network.
After the outbreak risk classification of the incidence data, the data for the period from January 2011 to December 2017 were remained. A total of 300 sets of data from January 2011 to December 2015 were used as the training set for five cities in each of the two regions for the parameters learning. Finally, 120 sets of data from January 2016 to December 2017 were used as the validation set to evaluate the effect of classification recognition. Through parameter learning, it is found that the distribution of meteorological factors in the two regions is different in different grades, and their prior probabilities are shown in Supplemental Table S3.
The risk levels of tuberculosis outbreaks in both regions have 6 parent nodes in this study. Taking TaiYuan as an example, assuming that the average air pressure, average temperature, monthly precipitation, average relative humidity and average wind speed at t time are all grade 1, the transfer probability of risk level nodes is shown in Table 2.
Table 2
Transition probability of TaiYuan risk level node
level of risk(t)
|
level of risk (t + 1)
|
Warning
|
No warning
|
Warning
|
0.9431818261828593
|
0.8823357944000647
|
No warning
|
0.05681817381714072
|
0.1176642055999353
|
Verification and comparison of the warning effect of the model
Among the 420 sets of data collected in this study, 34 groups were classified as "outbreak" early warning, the ratio of "early warning" to "non-early warning" was about 11:1 in the first region. And 26 groups were classified as "outbreak" early warning, the ratio of "early warning" to "non-early warning" was about 15:1 in the second region. In the first and second regions, the proportion of risk level of training and validation set with "outbreak" warning was 1.6:1, and 1.4:1, respectively.
It can be seen from Table 3 that the classification accuracy of the three models in the two regions is all over 75% and 80%, respectively. And DBN is the highest at 95.00% and 97.5%, respectively. Other indexes such as precision, TPR, TNR, F-Measure and G-measure are also the highest in DBN. Among them, the F-measure value of DBN is the largest, the two regions are 0.77 and 0.84 respectively, indicating that DBN reflects the performance of minority classes better than the other two models; G-measure values are the largest, the two regions are 0.86 and 0.85 respectively, indicating that its comprehensive classification performance for minority and majority categories is better.
Table 3
Comparison of classification accuracy of the three models in identifying "outbreak" early warning
Region
|
index
|
DBN
|
BN
|
SVM
|
the first
region
|
ACC(%)
|
95.00
|
86.67
|
79.17
|
precision
|
0.77
|
0.36
|
0.20
|
TPR
|
0.77
|
0.31
|
0.31
|
TNR
|
0.97
|
0.93
|
0.58
|
F-Measure
|
0.77
|
0.33
|
0.56
|
G-measure
|
0.86
|
0.54
|
0.42
|
the second region
|
ACC(%)
|
97.50
|
95.00
|
83.33
|
precision
|
1.00
|
1.00
|
0.36
|
TPR
|
0.73
|
0.45
|
0.45
|
TNR
|
1.00
|
1.00
|
0.66
|
F-Measure
|
0.84
|
0.62
|
0.62
|
G-measure
|
0.85
|
0.67
|
0.54
|
ROC analysis
DBN, BN and SVM models respectively fit the two-regional data, the AUC was calculated and the statistical difference was tested, as shown in Fig. 4 and Table S4.
Except for the SVM model fitted in the first region, the ROC curve area AUC of the other models is more than 0.6. The AUC of the two regions is DBN > BN > SVM from the largest to the smallest. Pairwise comparison was carried out on the AUC of the three models in the two regions, the results show that in the first region, the AUC of BN and SVM is statistically different from DBN, but there is no statistical difference between BN and SVM. In the second region, the AUC of the three models were statistically different from each other.
In general, for the panel data with spatial and temporal dimensions, the classification and recognition performance of the DBN warning model of TB-meteorological factors established in this study is significantly better than that of the static BN and SVM models, whether it is to distinguish the minority categories or the majority categories.