The incidence data of HB from January 2013 to September 2022 were obtained from the Health Commission of Henan Province (https://wsjkw.henan.gov.cn/zfxxgk/yqxx/index.html). HB data is reported within 24 hours through the legal reporting infectious disease system. Misreported data should be corrected immediately, and missed HB cases should be reported in a timely manner. Check the information reported on a daily basis and remove duplicate reports The series from January 2013 to September 2021 was used as the training set, it is used to prove the predictive potential of BSTS method and its applicability and sufficiency in estimating the epidemiological trend of HB incidence. According to the periodic pattern of HB incidence, the training set was from January 2013 to September 2021, and the data of the last 12 months was used as the test set to verify the prediction effect of the model. Besides, the moving forward 36 steps (from October 2019 to September 2022) and predicting 24 steps (from October 2020 to September 2022) were deemed as sensitivity analysis to further verify the effectiveness and robustness of the model. This study protocol was approved by the study institutional review board of the Xinxiang Medical University (No: XYLL- 2019072). We collect data anonymously. This data is second-hand and publicly available, so it does not need ethics.
ARIMA model: The ARIMA model can be expressed as ARIMA (p, d, q) (P, D, Q)S, where p, d, and q represent the order of the non-seasonal autoregressive, differencing, and moving average components, respectively, while P, D, and Q represent the order of the seasonal autoregressive, differencing, and moving average components, respectively (7). The process of modeling with the ARIMA model includes:1) Testing for data stationarity using the Augmented Dickey-Fuller (ADF) test. If the p-value is greater than 0.05, it indicates that the sequence is non-stationary and may require differencing and/or data transformation to achieve stationarity. 2) Estimating model parameters: Based on the stationary sequence, the orders of p, q, P, and Q are identified and estimated using the autocorrelation function (ACF) and partial autocorrelation function (PACF). The optimal parameters of the model are determined using criteria such as the corrected Akaike Information Criterion (AICC) and Bayesian Information Criterion (BIC) (8). 3) Model diagnosis employs the ACF, PACF, and Ljung-Box Q test to assess the residual error of the model and determine if it is a random sequence (9). Navid Feroze (10)has described this method.
The mathematically form the ARIMA (p, d, q) model can be written as:
$${W}_{t}={\theta }_{1}{w}_{t-1}+{\theta }_{2}{w}_{t-2}{ + }_{\dots }+ {\theta }_{p}{w}_{t-p}+{\omega }_{1}-{\lambda }_{1}{\omega }_{t-1}-{\omega }_{2}-{\lambda }_{2}{\omega }_{t-2}$$
$${ - }_{\dots }-{\omega }_{q}-{\lambda }_{q}{\omega }_{t-q }$$
1
where, \({\theta }_{p}\) represents the terms of autoregressive process, \({\omega }_{q}\) are the coefficients of the error terms, \({\lambda }_{q}\) are the values of moving average operator and \({W}_{t}\) is d-order differenced time series.
BSTS model: The BSTS model is a statistical technique used for feature selection, time series forecasting, nowcasting, inferring causal impact and other applications. The model consists of three main components: a Kalman filter for time series decomposition, where a researcher can add different state variables such as trend, seasonality, and regression; a spike-and-slab method for selecting the most important regression predictors; and Bayesian model averaging for combining results and calculating predictions. Spike and slab is a shrinkage method, where the spike allows the posterior to shrink insignificant parameters towards zero, while the right continuous tail can identify nonzero parameters. The spike and slab optimal variable selection method was enabled to reduce a larger dataset of correlated variables to a smaller dataset which includes the important variables by imposing prior beliefs on the model. Besides this reduction in the predictor’s list, it can also remove the multicollinearity among independent predictors in the regression coefficients.(11, 12) Importantly, the predictions made using the BSTS method rarely depend on specific hypothesized specifications. The forecast generated by the BSTS model is based on prior information and the likelihood function, which are combined to produce a posterior distribution (11). A Markov Chain Monte Carlo (MCMC) algorithm is used to sample from the posterior distribution, and the sampling results are then averaged to obtain the final prediction(10, 13). In contrast, ARIMA models typically predict based on past patterns of disease and previous prediction residuals(14). Fortunately, intervention analysis using the BSTS method can be used to estimate the causal impact of the COVID-19 pandemic on the reduction of hepatitis B cases(15). This method can generate a counterfactual prediction in a comprehensive control sequence, describing what would happen if there were no such intervention measures during the outbreak of COVID-19(12). Using the BSTS model, we estimated the monthly number of cases from October 2021 to September 2022 based on the monthly data from January 2013 to September 2021 by considering seasonal and long-term trends. The statistical formula for a BSTS model is shown below:
Eq (2) describes how the observed data \({Y}_{t}\) (number of hepatitis B cases) relates to the underlying state, so it is called the observed equation. Where \({Y}_{t}\) is the k×1 vector of observed values, \({Z}_{t}\) is the k×m matrix containing known values, \({\alpha }_{t}\) is the unobserved k×1state vector, and \({\epsilon }_{t}\) is the Gaussian error term of a random and independent distribution with zero mean and variance \({H}_{t}\). Eq. (3) defines how the underlying state changes over time and is known as the transition equation. It is defined by the Markov chain Monte Carlo (MCMC) algorithm. Where \({T}_{t}\) is the k×k transition matrix, \({R}_{t}\) is the k×m error control matrix (identifying the transition equation row error term with non-zero) and \({\eta }_{t}\) another Gaussian random error term with mean zero and variance \({Q}_{t}\).
Statistical analysis
The Hodrick Prescott (HP) and seasonal index methods were used to analyze long-term trends, cycles, and seasonal indices of data. In order to eliminate short-term (monthly) fluctuations and determine long-term time series for many years, this article uses the HP method as a periodic and seasonal decomposition method (Using Eviews12 software).(16, 17) The seasonal index analysis method for time series data can estimate seasonal indicators for each season, month, and week. The study analyzed the change rule of hepatitis B by calculating the seasonal index of each month.(18)The ‘forecast’ and ‘bsts’ packages in R4.2.0 were used to build ARIMA and BSTS models, respectively. The Causal Impact package in R was used to estimate the influence of COVID-19 on the probability of hepatitis B. The predictive accuracy of the model was tested by calculating the mean absolute error (MAD), mean absolute percentage error (MAPE), root mean square error (RMSE), and root mean square percentage error (RMSPE).