Modeling and Forecast of COVID-19 Vaccinations Population Size in China: A Non-Clinical Study

coronavirus-2, the virus responsible for the COVID-19 pandemic, has considerably affected the worldwide population. Health authorities and the medical community identify vaccines as an effective tool for managing public health. In this study, the autoregressive integrated moving average (ARIMA) model built-in Python was adopted to establish the COVID-19 vaccination forecast model. In this study, the sample data were selected from the Our World in Data website. COVID-19 vaccinations administered daily in China from December 16, 2020 to March 21, 2021 were analyzed to establish an autoregressive integrated moving average (ARIMA) model. In this study, the ARIMA model has demonstrated high goodness of fit for forecasting the COVID-19 vaccinations, which can be used in the short-term prediction of the monitored COVID-19 vaccination data sequence, providing a reference for the establishment of immune barrier and the supply of vaccines in China.

In this study, the ARIMA model has demonstrated high goodness of fit for forecasting the COVID-19 vaccinations, which can be used in the short-term prediction of the monitored COVID-19 vaccination data sequence, providing a reference for the establishment of immune barrier and the supply of vaccines in China.

Background
The COVID-19 pandemic is causing an unprecedented effect on global health and economics. When safe and efficient vaccines and treatments were unavailable, nonpharmaceutical interventions were used to reduce transmission and the burden of the pandemic. However, most of these measures involved huge economic costs.
Therefore, effective COVID-19 vaccines have been urgently needed to lower the morbidity and mortality of COVID-19 (Li et al., 2020). The COVID-19 pneumonia outbreak at the beginning of 2020 involved the first COVID-19 cases caused by the novel β-coronavirus severe acute respiratory syndrome CoV-2 (SARS-CoV-2) (Tregoning et al., 2020). The genetic information was disclosed on January 10, 2020, 54 days after the first case was reported. On March 13, 2020, 63 days after the SARS-CoV-2 virus was sequenced, the news was released that the first dose of the human vaccine was under testing. The Strategic Advisory Group of Experts on Immunization of the World Health Organization has outlined the value framework for COVID-19 vaccine allocation and priority, revealing the core principles of vaccine distribution (World Health Organization, 2020). These guidelines require further specifications and should be targeted to each country and their local conditions, including the intensity of the pandemic, target of pandemic control, supply of vaccines, and population eligible for vaccination. As the first epicenter country of COVID-19 pneumonia, China has invested heavily in its vaccination, being the main participator in the COVID-19 vaccine development for controlling the pandemic, wherein its vaccines were provided by the government, vaccine makers, and nongovernmental organizations (World Health Organization, 2020

Data source
In this study, the sample data were selected from the Our World in Data website (https://ourworldindata.org/coronavirus-testing#china), established by the Oxford University, on the basis of the decades-long data involving the human living standards of various countries. The website provides the daily updated number of  vaccinations and the complete COVID-19 dataset on its COVID-19 Explorer. The data were collected by browsing the official public information, and specifically, the data for China were sourced from the information published by the National Health Commission of the People's Republic of China. In this study, the daily vaccination uptake in China from December 16, 2020, to March 21, 2021, updated by the Our World in Data website, was used for modeling and the short-term forecast.

ARIMA model
The ARIMA model is mostly used to analyze the nonstationary, nonseasonal time series (Box, Jenkins & Reinsel, 2010), which is the most commonly used model for fitting nonstationary series. It can be further divided into the autoregressive (AR), moving average (MA), and autoregressive moving average (ARMA) models (Kantelhardt, et al, 2002). They are explained below in detail:

AR model
An AR model with the following structure is called the -order AR model and denoted as AR( ). The mathematical formula is as follows: Here, the value of the random variable at moment t is the multiple linear regression of the first terms , …, and . The error term is the current random interference , which is a zero-mean white noise sequence. When ∅ = 0, the model is called the centralized AR(ρ) model.

MA model
A MA model with the following structure is called the qth-order AR model and denoted as MA(q). The mathematical formula is as follows: Here, the value of the random variable at moment t is the multiple linear function of the first q random interferences , …, . The error term is the current random interference , which is a zero-mean white noise sequence.
Specifically, when μ = 0, the model is called as the centralized MA(q) model.

ARMA model
An ARMA model with the following structure is denoted as ARMA(p, q) and its mathematical formula is as follows: The ARMA model is used for stationary time series, which is a combination of the AR and MA models. It is supposed that is the observed time series, is the white noise sequence, and is the mixed process of autoregression and moving average. When the orders of AR and MA are p and q, respectively, the model is denoted as ARMA(p, q).

ARIMA model
If the time series is nonstationary, its difference should be considered to transform it into a stationary one, which is then followed by the forecast using the ARMA(p, q) model. These processes constitute the ARIMA(p, d, q) model wherein p represents the number of autoregressive terms, d the times of difference of the time series, and q the number of moving average terms. The structure of the ARIMA(p, d, q) model is as follows:

ARIMA modeling steps
For time series analysis, the ARIMA model forecasts through the following four steps: time series stationarity test, model recognition and order determination, model validation, and model forecast. These specific processes are explained below:

Series establishment and stationarity test
The time series ARIMA model must be established on the basis of a stationary time series as it cannot capture nonstationary ones. Therefore, the stationarity test is needed at first. The typical methods of stationarity tests include the sequence diagram, autocorrelation, and unit root tests. In the sequence diagram test, the mean and variance curves of the time sequence data are plotted to determine if the time series is stationary by observing its trend. In the autocorrelation test, the autocorrelation function (ACF) figure is used. For nonstationary data, the ACF figure will approach 0 slowly or not at all. For the unit root test, the solved augmented Dickey-Fuller (ADF) value will be used to determine whether the sequence data are significant in the given confidence interval. If so, the sequence data are stationary. The ADF test assumes that the time series has a unit root, i.e., the time series is nonstationary with periodic fluctuations. If the test result indicates that the absolute value of T is lower than the observation level of 1%, 5%, or 10% and the significance probability ρ value is larger than 0.05, the sequence is deemed as nonstationary, and the original hypothesis cannot be rejected. Similar to Eq. (10), the mean of the random series is expressed as follows: where represents the expectation of the random series at moment t.
If the time series is determined to be nonstationary, i.e., periodic or with a marked trend, the difference of the original sequence data will be determined. The following Eq. (11) gives the d difference equation: Here, indicates the stationary sequence after difference; Β is the delay operator, indicating that the times of time index difference is d for the series.

Model recognition and order determination
After d times of difference, the ACF and partial autocorrelation function (PACF) tests on the stationary time series are commonly performed to obtain the number of layer p and order q for ARIMA. By observing the truncation and trailing of the ACF and PACF tests, the model is fitted. The fitting results are compared and adjusted accordingly for the preliminary construction of one or more suitable ARIMA models.
If both the ACF and PACF tests of the time series are trailing, it can be determined that the forecast model is ARMA (p, q). The values of p and q are investigated gradually from the lower orders, and the smallest p and q are selected that can minimize the values of the Akaike information criterion (AIC) and Bayesian information criterion (BIC). Eq. (12) shows the mathematical expression of the ACF: where represents the value of the random series at moment t, and represents the mean of the random series.
Eq. (13) displays the mathematical expression of the PACF: where = , , whereas indicates the ACF of the random series.

Model validation
Before establishing the ARIMA model to forecast the number of administered

Model forecast
The main function of the time series analysis is to explain the autocorrelation of a time series with mathematical models, thereby predicting the future variation rule of the series. In this study, the daily data of the COVID-19 vaccinations administered in China from December 16, 2020 to March 21, 2021 will be selected to find the optimum forecast model and make five-day projections. By comparing the predicted value with the actual data, the forecast performance will be assessed.

Time series establishment and stationarity test
The data of the daily COVID-19 vaccinations administered in China from December 16, 2020 to March 21, 2021 were first used to create the time series plot. The original time series was nonstationary, which required the operation of the difference to be stationary. It was found that after trial, the first-and second-order differences could remove the trend of the original series, making it stationary. The time series plots after the difference are shown in Figures 2(b) and 2(c). In addition, after k=3, the autocorrelation and partial autocorrelation coefficients gradually decreased and fell into the confidence interval, indicating that the current time series turned stationary (Figure 2(c)).
[ Figure 2 about here] Plots of autocorrelation and partial autocorrelation before and after difference. (a) Plots of autocorrelation and partial autocorrelation for the original series, (b) Plots of autocorrelation and partial autocorrelation after the first-order difference, (c) Plots of autocorrelation and partial autocorrelation after the second-order difference.
The unit root (ADF) method was then adopted for the stationarity test. As shown in

Model recognition and order determination
According to the ACF and PACF plots of given in Figure 2(c), the autocorrelation plot after the second-order difference showed a significantly nonzero autocorrelation number of 2 or 3 with the trailing property, so q = 2 or 3. In addition, the partial autocorrelation plot showed a PACF of 2 or 3; therefore, p = 2 or 3. From the ADF test results, when the times of difference was 2, the series was stationary, so Therefore, when selecting the parameters for ARMA(p, q), AIC and BIC were combined, and the relatively optimum model was searched according to the minimization principle of AIC, BIC, and HQIC (Burnham & Anderson, 2004). From the lowest order, the AIC, BIC, and HQIC values of the preliminary models were calculated, and the AIC thermodynamic diagram was created so as to determine the orders.
[ Table 2 about here] Table 2. AIC, BIC, and HQIC values of the ARMA model Table 2

Model validation
After model establishment, the residual sequence needs to be checked to determine if it only contains white noise. If not, it indicates that there still exists useful information in the residual, demanding further revision of the model until the residual sequence is a white noise sequence. The validation process was as follows. First, when the size of the fitting model was determined, the significance of the model was assessed (α = 0.05). It was p = 0.14, much larger than 0.05. Therefore, the original hypothesis was acknowledged, assuming that the time series was a white noise sequence, i.e., a randomly generated series with no temporal correlation. Second, the autocorrelation plot of the residual was created, and the pure randomness of the residual was tested. The results of the residual autocorrelation test of ARIMA(3, 2, 3) are shown in Figure 4(b), where the ACF is basically in the 95% confidence interval, and the residual is white noise. Finally, the Q-Q plot of the ARIMA model was drawn (Figure 4(c)). The points in the middle are almost overlapping with the straight line, proving that the residual of the forecast model ARIMA(3, 2, 3) followed the normal distribution, and thus, the established ARIMA(3, 2, 3) was reasonable.
[ Figure 4 about here] Results of the ARIMA model validation. (a) Plot of residual sequence, (b) Autocorrelation and partial autocorrelation plots of the residual sequence, (c) Q-Q plot of the standardized residual for the ARIMA model.

Model forecast
The forecast ( ) function from the StatsModels library of Python was used for constructing the model combining the values of p, d, and q. Specifically, the ARIMA(3, 2, 3) model for forecasting COVID-19 vaccinations was established and five-day projections were made. The modeling code is as follows: data=data.astype('float64') model=ARIMA (data,order=(p,d,q) The predicted values, actual values, absolute errors, relative errors, and variation coefficients are given in Table 3. The predicted trend was observed to match with the actual trend with a good fitting. The average absolute error was −148175.2, the average relative error was 10.06%, and the average variation coefficient was 5.34%.
The forecast of the model was within an acceptable range. The formula for calculating the average relative error is as follows: The formula of the variation coefficient is as follows: = In Eq. (17), represents the standard deviation and represents the mean.
[ Table 3 about here] Table 3. Predicted results by the ARIMA(3, 2, 3) model  (Fedson, 1998). This would raise the question of whether to vaccinate with currently available vaccines or wait for one with a higher effectiveness against that particular variant Therefore, the ARIMA model was only applicable to short-term vaccination forecasts (Baden et al., 2021;Polack et al., 2021). For more accurate prediction, the inclusion of new data and adjustment of the model parameters are needed to adapt to the actual vaccination scenario.

Conclusion
In summary, the ARIMA (

Consent for publication
Not applicable

Availability of data and materials
The data that support the findings of this study are available from Our World in Data but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of Our World in Data.  Plots of autocorrelation and partial autocorrelation before and after difference.
(a) Plots of autocorrelation and partial autocorrelation for the original series, (b) Plots of autocorrelation and partial autocorrelation after the first-order difference, (c) Plots of autocorrelation and partial autocorrelation after the second-order difference.  Performance of COVID-19 vaccination forecast.