DOI: https://doi.org/10.21203/rs.3.rs-2259096/v1
Objective: Scarlet fever is an increasingly serious public health problem that has attracted widespread attention worldwide. In this study, two models were constructed based on time series to predict the number of scarlet fever incidence in Jiangsu province, China
Methods: Two models, ARIMA model and TBATS model, were constructed to predict the number of scarlet fever incidence in Jiangsu province, China, in the first half of 2022 based on the number of scarlet fever incidence from 2013-2021, and root mean square error (RMSE) and mean absolute percentage error (MAPE) were used to select the models and evaluate the performance of the models.
Results: The incidence of scarlet fever in Jiangsu province from 2013 to 2021 was significantly bi-seasonal and trendy, and the best ARIMA model established was ARIMA(1,0,1)(2,1,1)12, with RMSE=92.23 and MAPE=47.48% for the fitting part and RMSE=138.31 and MAPE=79.11 for the prediction part. The best The best TBATS model is TBATS(0.278,{0,0}, -, {<12,5>}) with RMSE=69.85 and MAPE=27.44% for the fitted part. The RMSE of the prediction part=57.11, MAPE=39.52%. The error of TBATS is smaller than that of ARIMA model for both fitting and forecasting.
Conclusion: The TBATS model outperformed the most commonly used SARIMA model in predicting the number of scarlet fever incidence in Jiangsu Province, China, and can be used as a flexible and useful tool in the decision-making process of scarlet fever prevention and control in Jiangsu Province
Scarlet fever is an acute respiratory disease caused by Group A Streptococcus (GAS), which is highly contagious and easily contracted by children[1].Scarlet fever occurs most often in winter and spring, and children between the ages of 5 and 15 are vulnerable to infection[2]. The incubation period of scarlet fever is usually 2–10 days, and if left untreated, it can spread for up to 10–21 days[3]. Scarlet fever is difficult to recognize in the early stages of the disease and its clinical manifestations include fever, headache, nausea, vomiting, etc. In severe cases, it can cause sepsis and even death[4] .In the 20th century, thanks to modern medical advances and the widespread use of antibiotics, scarlet fever was once at a low epidemic level[5]. However, since the 21st century, there has been a resurgence of scarlet fever outbreaks, and the incidence of scarlet fever has been increasing worldwide, including in Hong Kong and Korea in Asia[6, 7] .This has resulted in a serious local disease burden[8].
In China, scarlet fever is a priority category B infectious disease, and the overall incidence of scarlet fever in China has increased significantly since 2004 and has seasonal characteristics[9]. The prevalence of scarlet fever in Jiangsu Province has been on the rise since 2005, with a bimodal distribution[10]. The increase in the incidence of scarlet fever may be related to the increase in consultation rates, variation in pathogens, and changes in meteorological conditions due to the increase in mobile population[11] .There is no specific vaccine for scarlet fever, and the epidemic poses a serious risk to human health and safety and is an important public health problem worldwide[12] .
Scientific prediction of scarlet fever incidence trends can help to grasp the epidemiological characteristics of scarlet fever promptly, which is conducive to prevention and control efforts and has significant public health implications for disease prevention and control. Several mathematical models have been applied to the prediction of various diseases, such as ZHANG et al. applied a combined ARIMA-SVM model to predict the incidence trend of renal combined hemorrhagic fever in Shandong Province, China[13], Surendran et al. used a SARIMA model to predict the incidence of dengue fever in Sri Lanka[14], Rguibi et al. applied LSTM and other models to predict the number of COVID-19 infections in Morocco[15] .The ARIMA model is a classical time series analysis method that can explore the pattern of disease occurrence based on historical incidence data to predict the future incidence trend. However, if only a single ARIMA model is used to predict the trend, it cannot capture the nonlinear characteristics in the time series completely[16] .Due to the influence of various factors, the morbidity trend of scarlet fever tends to show unsteady, irregular, and nonlinear characteristics[17]. In order to compensate for the shortcomings of ARIMA models, effectively capture the linear trends as well as nonlinear trends in time series and make full use of all valid information in time series, the emergence of the TBATS model can be a good solution to this problem. TBATS was developed in 2011 by DeLivera et al[18] .Based on BOX-COX transform, ARIMA, and exponential smoothing model, a combined model TBATS was developed considering complex seasonal and long-term trends The model innovatively introduces an exponential smoothing state space modeling framework for forecasting complex time series with multiple seasonal cycles, high-frequency seasonality, non-integer seasons, and dual calendar effects.
In this study, the ARIMA model and TBATS model were developed to predict the number of scarlet fever cases in Jiangsu Province in the first half of 2022 using the number of scarlet fever cases in Jiangsu Province from 2013 to 2021, and the fitting prediction effects of the 2 models were evaluated, aiming to provide scientific reference for the prevention and treatment of scarlet fever in Jiangsu Province.
The month-by-month incidence numbers of scarlet fever for 2013–2018 are available from the Public Health Sciences Data Center, (https://www.phsciencedata.cn/Share/), and data for 2019 onwards are from the statutory infectious disease reporting data published by the Jiangsu Provincial Health Commission༈http://wjw.jiangsu.gov.cn/༉. Because of the large population base and relative stability in Jiangsu province, the number of scarlet fever cases could be used instead of the incidence rate to reflect the prevalence. A database was established using Excel 2016, and a correlation ARIMA model was developed using R software for the number of scarlet fever cases data from January 2013-December 2021 to predict the number of scarlet fever cases from January to June 2022.
The ARIMA model treats the data series of the forecast object over time as a stochastic series and describes this series approximately by a certain mathematical model.The ARIMA model consists of ARIMA (p, d, q) without seasonality and ARIMA (p, d, q) with seasonality (P,D,Q)S[19]. In the ARIMA model, three important parameters p,d,q are the autoregressive order, the number of differences, and the moving average order, respectively. P, D, and Q are seasonal parameters. s denotes the seasonal period (s = 12 in this study). The principle is to transform the non-stationary time series into a stationary time series and then regress the dependent variable only on its lagged value and the present and lagged values of the random error term[20] .The ARIMA model can be built in the following steps(①) The establishment of a smooth series, the unstable time series is transformed into a smooth time series by differencing or data transformation (②) The identification of the model, the parameters p, q of the model are initially determined by plotting the autocorrelation coefficient (ACF) and partial autocorrelation coefficient (PACF) of the series, and the parameters p, q of the model are determined by the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC). (AIC) and Bayesian Information Criterion (BIC) minimum to determine the optimal model, and also through the R software auto. arima function to automatically determine the best model, and at the same time and the model parameters for statistical testing ③ Model testing: the model for white noise test (Ljung- Box test) to test whether the residual series is a white noise series. When the Ljung-Box test P > 0.05, the residual series is white noise series, that is, the effective part of the original series is extracted sufficiently, and the established model is valid.
TBATS is an exponential smoothing-based forecasting model, however, the traditional exponential smoothing model cannot describe multiple seasonal models, non-integer seasonal models, and complex time series with dual calendar effects. To accommodate more multi-seasonal time series, the TBATS model is proposed. The "BATS" is an exponentially smoothed spatial state model incorporating Box - Cox transformation, ARMA errors, Trend and Seasonal components, and the name is also an acronym for each component. Further, the seasonal components based on trigonometric Fourier series are introduced into BATS, and such models are called TBATS, with the name "T" symbol for trigonometric[18] .The TBATS model combines the nonlinear factors through BOX-COX transform and captures the autocorrelation in the residuals using a static ARMA model[21] .The TBATS model can be expressed as TBATS (ω, {p, q}, φ, <mi,ki>) [22], where ω denotes the Box-Cox transform, φ denotes the damping parameter value, p and q are the residual values of ARMA, mi is the seasonal frequency (i = 12 in this study), and ki is the Fourier series corresponding to seasonality. The model has numerous parameters, and the values of various parameters are determined automatically in R using the AIC minimization principle to construct the TBATS model.
The database is built using Excel 2016, and the ARIMA model and TBATS model are built using R4.1.3. The ARIMA model is built using the "tseries", "fUnitRoots", and "forecast" packages in R software.
In this study, the evaluation metrics of the model are described by two aspects, which are the 2 metrics of mean percentage error (MAPE) and root mean square error (RMSE). MAPE indicates the degree of deviation of the prediction result from the actual value, and RMSE indicates the index of deviation of the overall prediction result from the actual situation. When both are smaller, the higher the forecast accuracy is. The calculation formula is as follows
The total number of cases in Jiangsu Province from 2013–2021 is 25,180, with an average of 233 cases per month. The highest number of incidences peaked in December 2019 with 1285 cases and the lowest month was March 2020 with 9 cases. The number of scarlet fever cases in Jiangsu Province from 2013–2021 is shown in Fig. 1. There is a clear seasonal trend in its prevalence: the overall bimodal distribution each year: mainly concentrated in May and June each year and from December to January of the following year. Figure 2 shows the specific number of cases per year from 2013 to 2021
Decomposition of the original series for the incidence trend is shown in Fig. 3, where it is seen that the incidence trend shows an increasing trend from 2013 to 2020 and a decreasing trend since 2020. From the periodicity it is seen that there is a strong trend in the incidence of scarlet fever and it is necessary to perform a periodic differential.
After adf test, Dickey-Fuller = -2.8326, p = 0.2313. the original series is a non-stationary time series and it is necessary to differ. After first-order 12-step differencing (D = 1,s = 12) Dickey-Fuller = -7.4669, p = 0.01, the differenced series is a smooth time series. For the determination of the non-seasonal parameters p,q, the autocorrelation coefficient plot (ACF) and the partial autocorrelation coefficient plot (PACF) after differencing were plotted to determine. For the determination of the seasonal parameters P,Q, generally not more than 2, are used to try one by one from the lowest to the highest. The ACF and PACF plots of the differenced series can be judged (Fig. 4), taking the values of q = 1 and p = 1. After step-by-step attempts for each parameter of the model, several alternative models are first determined, and the best model is determined based on the principle of minimum AIC, as detailed in Table 1. Among various alternative models, the minimum AIC of ARIMA (1,0,1) (2,1,1)12 is found to be 1193.6, and the Ljung-Box statistic at lag order 12 was calculated, and the results showed that P > 0.05, suggesting that the residual series is white noise series, the information extraction is sufficient, and the model establishment is effective can be determined as the best model of ARIMA(1,0,1)(2,1,1)12. The model ARIMA(1,0,1)(2,1,1)12 was used to fit the prediction of the number of morbidity in Jiangsu Province in the first half of 2022, as shown in Fig. 5. The degree of fit was evaluated using the “accuracy()” function with RMSE = 92.2323 and MAPE = 47.4887.
Alternative Models | AIC | Residual Ljung-Box statistic | P |
---|---|---|---|
ARIMA(1,0,1))(0,1,1)12 | 1187.98 | 15.757 | 0.2026 |
ARIMA(1,0,1))(1,1,0)12 | 1202.63 | 18.568 | 0.9968 |
ARIMA(1,0,1))(1,1,1)12 | 1187.89 | 16.437 | 0.172 |
ARIMA(1,0,1))(0,1,0)12 | 1217.01 | 31.095 | 0.0019 |
ARIMA(1,0,1))(0,1,2)12 ARIMA(1,0,1))(2,1,1)12 | 1187.8 1187.7 | 16.508 16.52 | 0.1691 0.1686 |
Alysha saved the implementation of the TBATS model directly in the “forecast ”package of the R software, so the “tbats()” function in the forecast package can be called directly to achieve automatic modeling. The model is obtained with parameters λ = 0.278, α = 1.3409, p = 0, q = 0, disease data with only one seasonal cycle, seasonal cycle length mi = 12, Fourier series ki = 5, and model AIC = 1386.24. The model can be expressed as TBATS(0.278, {0,0}, -, < 12,5>). The fit of the model was evaluated using the“accuracy()” function with RMSE = 69.8523 and MAPE = 27.442 using the model to predict the number of scarlet fever cases in January-June 2022, and the results are shown in Fig. 6
The prediction results and error comparison of the two models are shown in Tables 2 and 3. ARIMA (1,0,1) (2,1,1)12 has MAPE = 47.48% and RMSE = 92.23 in the fitting phase. The relative errors in the prediction part are larger in April, May and June, and smaller in the rest of the months, RMSE = 138.31,MAPE = 79.11%, and the prediction accuracy needs to be further improved. Further improvement is needed. in the fitting phase of TBATS model MAPE = 27.44, RMSE = 69.85, and in the forecasting part MAPE = 39.52, RMSE = 57.11. the error of TBATS is less than ARIMA (1,0,1)(2,1,1)12 in both training and test sets12, compared with the best ARIMA model, TBATS in better performance on both training and test sets. However, the errors of both models are high, and the comparison plots of the two models are shown in Fig. 7.
Month | Actual Value | ARIMA | TBATS | ||||
---|---|---|---|---|---|---|---|
ARIMA MODEL | 95%confidence interval | TBATS MODEL | 95% confidence interval | ||||
1 | 257 | 213 | 10.85 | 416.12 | 177 | 92.66 | 307.15 |
2 | 81 | 59 | -232.97 | 351.68 | 48 | 8.01 | 156.33 |
3 | 149 | 157 | -160.72 | 473.79 | 85 | 11.11 | 310.40 |
4 | 84 | 254 | -71.60 | 579.10 | 149 | 18.87 | 550.99 |
5 | 180 | 413 | 85.21 | 741.37 | 237 | 30.18 | 870.67 |
6 | 181 | 352 | 22.64 | 680.65 | 201 | 16.55 | 881.88 |
evaluating indicator | fitting result | forecasting results | ||
---|---|---|---|---|
ARIMA | TBATS | ARIMA | TBATS | |
MAPE% | 47.48 | 27.44 | 79.11 | 39.52 |
RMSE | 92.23 | 69.85 | 138.31 | 57.11 |
Scarlet fever is the most common respiratory infectious disease in China, and its incidence level has been increasing in recent years with similar epidemic trends nationwide and worldwide[7, 23] .The prevention and control of scarlet fever still face major challenges. Adequate understanding of the epidemiological trends of scarlet fever and predicting its incidence are important guidelines for the prevention and control of scarlet fever and the allocation of related medical resources. Since the implementation of direct network reporting of scarlet fever in 2004[24], the reported incidence rate of scarlet fever in Jiangsu Province has been at a low level, which began to rise steeply in 2010, and then fluctuated, but the overall incidence rate increased significantly compared with historical levels, especially in 2019, when it reached the highest historical level since direct network reporting, and the situation of epidemic prevention and control was severe.
This study analyzed the trend of scarlet fever incidence in Jiangsu Province, China, from 2013 to 2021. From 2013 to 2021, the trend of scarlet fever incidence in Jiangsu Province had a clear seasonal distribution pattern, with a 12-month cycle, with one peak from April to June and another peak is between December and January of the following year. These two peaks coincide with the local school term and are similar to other Chinese provinces[23, 25–27]. On the one hand, it may be considered that the number of scarlet fever cases rises gradually due to the fact that March or April is the start of the school year in China, with more contact between students[27, 28]. On the other hand, it may be related to the change in temperature adaptation of Streptococcus haemolyticus.[29]. Therefore, it is crucial to strengthen scarlet fever control and preventive measures in schools during April and June. Then, based on the data from 2013 to 2021, this study established two time series models : ARIMA model and TBATS model, and predicted the number of scarlet fever cases in Jiangsu Province in the first half of 2022. The predicted results are different from the actual values, but it provides a new idea for the prediction and prevention of scarlet fever, a multi-seasonal disease.
ARIMA model is a classical time series analysis method, which is widely used in the prediction of various infectious diseases and has good results in the prediction of scarlet fever: wu et al. constructed an ARIMA (3, 1, 3) (3, 1, 0)12 model to predict the incidence of scarlet fever in Chongqing, China, and achieved good results[30]. The ARIMA model was developed for a series of time-varying, but interrelated, dynamic data. The seasonal component of the time series can be extracted for diseases whose incidence has seasonal characteristics. However, ARIMA models ignore the nonlinear component of the time series, have limited ability to make long-term predictions, and cannot handle complex time series with multiple seasonality, high frequency seasonality, non-integer seasonality, and double calendar effects[18]. The TBATS model can handle data with multiple seasonality and introduces Box-Cox variation to deal with the nonlinear features in the time series. The results of the time series analysis show that the TBATS model outperforms the ARIMA model in both the training and test sets, and the results are similar to those of previous studies[31, 32] .The TBATS model also has other advantages: one is the stability of the model results, and the other is that fewer initial parameters need to be estimated, so the decomposition of the components of the time series is more stable. This gives TBATS the potential to describe the long-term prevalence of scarlet fever, given the epidemiological characteristics of scarlet fever incidence[33] .
The results of this study show that the fitting effect of ARIMA model and TBATS model is stronger than the prediction effect, and the prediction results are high, with MAPE greater than 20%, and the prediction accuracy needs to be further improved. The consideration may be related to the characteristics of these two models themselves. Time series analysis relies only on historical data and temporal information, and scarlet fever, as a respiratory infectious disease, its onset is influenced by a variety of factors, such as temperature, relative humidity, precipitation, and other meteorological factors[34–37] .This can cause inaccuracy in the prediction effect of single-factor prediction models. Similarly, the historical span chosen for this study is long, up to nine years, and changes in meteorological factors will also have a greater impact on the model, leading to a decrease in the long-term prediction effect of the model. More importantly, the time series analysis also ignores the impact of public health policies. Since March 2022, the outbreak in Shanghai, Jiangsu Province, has been severely restricted by the Chinese government's closure policy, resulting in a significant decrease in the number of scarlet fever cases from March to June, which is lower than the previous period, resulting in an overprediction.
Based on the above limitations, we propose the following suggestions for future research: first, we need to continuously collect monitoring data to improve the accuracy of scarlet fever case reports [38]; second, we can try to combine different models to give full play to the strengths of each model to achieve good prediction results. Third, the ARIMAX model can be developed on the basis of the ARIMA model by adding spatial information or other covariates ( such as meteorological factors : temperature, humidity, wind speed, rainfall, etc. ) to improve prediction accuracy
In this study, the distribution of scarlet fever cases in Jiangsu Province from 2013 to 2021 has obvious seasonality, with two peak incidence periods. The ARIMA model and TBATS model were used to predict the incidence of scarlet fever in Jiangsu Province. The TBATS model has more fitting and predictive performance, and the TBATS model can be more suitable for predicting the incidence trend of multi-seasonal diseases.
Ethics approval and consent to participate
As the number of scarlet fever cases is a statistical summary, informed consent is not required
Consent for publication
Not applicable
Competing interests
No potential conflict of interest was reported by the authors
Author details
Acknowledgements
Not applicable.
Authors’ contributions
GZJ and ZLJ conceptualized the paper,GZJ did the statistical analysis and drafted the manuscript.GH、LYS and TCY conducted the research and data collection.ZLJ reviewed and edited the writing.All authors made significant intellectual contributions to multiple revisions of the draft. All authors have read and agreed to the published version of the manuscript.
Funding
The authors received no specific funding for this work
Availability of data and materials
The month-by-month incidence numbers of scarlet fever for 2013-2018 are available from the Public Health Sciences Data Center, (https://www.phsciencedata.cn/Share/),and data for 2019 onwards are from the statutory infectious disease reporting data published by the Jiangsu Provincial Health Commission(http://wjw.jiangsu.gov.cn/).