Title: Can Search Query Forecast successfully in China’s novel coronavirus （ 2019-nCov ） pneumonia?

Background: Recently the 2019-nCov pneumonia outbreak in China then the world. Search Query performs well in forecasting the epidemics. It is still a suspense whether search query can forecast the drift and the inflexion in 2019-nCov pneumonia. Based on the Baidu Search Index, we propose three prediction models: composite Index, composite Index with filtering (Fourier Transform) and suspected NCP(Novel Coronavirus Pneumonia) cases. With the trained models, we predict the new confirmed cases of 2019-nCov of forecast-period from Feb. 3 rd to 9 th . Attempting to identify the next peak period, we further estimate 10 day out-of-sample of the new confirmed cases from Feb. 10 th to 19 th . Results: We select 16 search queries related to NCP and calculate the correlation coefficient. The maximum correlation coefficient of search queries is above 0.8. The composite Index performs 10 days ahead of the new confirmed cases. With the In-sample prediction, the result demonstrates that the predictive model of composite index with filtering performs the best with MAPE 24.98% and RMSE 192.71. By contract, the predictive model of the suspected NCP cases is calculated with the prediction error of MAPE 8.82% and RMSE 368.51(almost twice the best model). With the Out-of-sample prediction, we monitor that there might be a peak value in Feb. 16 th to 17 th in the next ten days. Conclusion: With noise filtering, the predictive model can forecast the new confirmed NCP cases more accurately. However, the filtering eliminates the violent fluctuations of the series and cannot capture the rising and declining details of the predicted values. On contrast, the prediction accuracy based on search composite index is sensitive to prediction of peak and valley although its prediction error is larger. These two predictive models can be combined: monitoring the further volatility trend with filtering model while identifying the inflexion with composite model.

search engine [1] [2] [3], SNS (social networking services) [4] [5] and the data of disease prevention and control center [6] and sentinel hospitals. Because of its abundant data and its characteristics of strong periodicity and seasonal outbreaks, most of the existing studies are aimed at the predication of influenza outbreaks, which are divided into two categories: (1) nowcasting. GFT(Google Flu Trend) is the most representative and widely used predictive model. This model uses a range of flu keywords in Google's search engine, and the system will automatically track and analyze the flu as long as the users enter those keywords [7]. (2) forecasting. This kind of study focuses on exploring effective epidemic predictive model [8] for the purpose of early warning of epidemic. Yang et al. [9] put forward ARGO (Auto Regression with Google search data) which uses the autoregressive and self-adjusting of GFT series to improve the prediction accuracy, and found that the prediction error is nearly three times lower than that of GFT. When the model was used to predict flu in eight Latin American countries, the accuracy of flu prediction in six of them was higher than that in time series autoregressive model and was significantly higher than that in GFT [10]. Liu, K., Srinivasan, R., & Meyers, L. A. [11] built an early flu monitoring model based network data. This model predicts seasonal flu 16.4 weeks earlier than CDC (Centers for Disease Control) and 5 weeks earlier than Pervaiz et al [12] using Google data. In out of sample test, it was not effective in predicting 2009HINI, mainly because 2009HINI is earlier and more transmissible than seasonal flu.
However, the 2019-nCov that outbreaks in China this time is different from flu. It is a newfound airborne new virus, with no prior data, no obvious periodicity characteristics and limited data accumulation. How can we predict this disease in this case? SARS-Cov and MERS-Cov also have these predictive characteristics. In 2004, after the outbreak of SARS-Cov in China, some scholars used gene expression programming to establish the automatic Mathematical Modeling of the spread and trend of SARS in China. By simulating the epidemic in Beijing and Shanxi, it was found that the success rate of this method was 97% [13]. During the outbreak of H5N1flu in Egypt, the prediction effect and accuracy of Random Forest Model have obvious advantages over ARIMA model [14]. When the MERS-Cov propagation prediction model was established, Bayesian Belief Network was used to establish the risk assessment based on the region. The simulation of 200,000 users shows that the model gives high prediction accuracy of the region and category risk [15]. When monitoring the outbreak of H7N9 AIV in China from 2013-2017 by using weekly data of Baidu Index and Microblog Index, it was found that Baidu Index and Microblog Index always exist in a positive correlation for a certain lead time with confirmed cases during the outbreak of flu [16], which confirms the possibility of network can predict similar temporary epidemic transmitted diseases. Recently, many researches focus on the estimation of 2019-nCov by the transmission model [17][18][19][20][21][22][23][24]. By estimating the basic reproductive number 0 R to release the spread of the coronavirus during the epidemic progresses. All the models have to estimate with much assumptions and the forecasting accuracy are limited. Search data can successfully predict periodicity flu outbreaks. Can novel coronavirus pneumonia in China be predicted? Can the next diffusion turning point and outbreak peak be found? These both are problems to be solved urgently.

Data Source
With the rapid transmission of 2019-nCov recently, the data that NHS publishes data every day at regular intervals, including the number of accumulated confirmed cases, existing confirmed cases, new confirmed cases, existing suspected cases and new suspected cases. Considering the historical additivity of the number of accumulated confirmed cases, the number of existing confirmed cases is that the number of accumulated confirmed cases minus the number of cure and death cases. However, these two data are relatively small and the number of existing suspected cases has historical additivity and some overlap with the number of confirmed cases, so this paper chooses the number of new confirmed cases and new suspected cases to predict 2019-nCov 1 . NHS began to publish the daily data of new confirmed epidemic on January 17, 2020. This paper chooses the number of new confirmed cases from January 17, 2020 to February 9, 2020, totaled 24 days. The daily new suspected cases began to publish later so we choose the number of new suspected cases from January 20, 2020 to February 9, 2020, totaled 21 days. The data of the last seven days to be kept as the prediction set and the rest as the training prediction model of training set.
The main affected area of this 2019-nCov is China, and foreign confirmed cases are also related to the flow of Chinese residents, so this paper chooses the data of Baidu, the most widely used search engine in China. Baidu Index 2 regularly published the search volume of the previous day in the form of search terms. According to the latest research, the incubation period of 2019-nCov is 7-14 days. Baidu search data of 40 days from January 1, 2020 to February 9, 2020 were chose to ensure that Baidu data has enough lead time compared with the epidemic data.

Query Selection and Compositing
For screening of the Search Index, most of the literature adopt the search engine recommendation method, but its prediction error is large and the search terms recommended have too much useless information and noise. By considering that this disease has a long incubation period and the symptoms of this disease are relatively defined in the clinical cases so far, it indicated that the fever accounted for 87.9% and cough accounted for 67.7% of the confirmed cases [25]. Starting with the most common "fever" and "cough", this paper uses the demand map in Baidu search to select the relevant search terms that users demand the most in the change of search behavior before and after search terms. In addition, with the increasing understanding of the disease, the common search terms for the disease "the novel coronavirus pneumonia" and "Wuhan pneumonia" were added, and the search volume of the rest of common search terms almost coincides with the outbreak of the epidemic, which could be initially judged as the event-based search noise and could not be used as the search term for the outbreak prediction. After screening the search terms, and removing the unrecorded terms, this paper retains 16 search terms in total. Using Pearson Correlation Coefficient Method, the most commonly used in network data prediction, confirms the correlation and the lead time between search terms and new confirmed sequence. By calculating the correlation coefficient of each word from lead 16 periods to 0 period and new confirmed cases, it is founded that all search terms have a high linear dependence with the new confirmed sequence, which shows that the selected search terms have good prediction ability. Most of the search terms are 10 days earlier than the new confirmed sequence while new suspected sequence is 4 earlier than new confirmed sequence (correlation coefficient is 0.8914). Table 1 shows the maximum correlation coefficient between search terms and new confirmed sequence.
The correlation coefficient among the 16 search terms and new confirmed cases are all above 0.8 and leading, which indicates that these search terms have certain prediction ability for future epidemic. x is the search volume for each sequence. Recalculating the correlation coefficient after combination, it is found that the highest correlation coefficient is lead new confirmed 10 times, 0.9248. Figure 1 shows the sequence diagram between search composite index and new confirmed cases.

Model
On the basis of determining the time correlation between the search composite Index and new suspected cases corresponding to the new confirmed cases, the training set can be used to train and fit the prediction model, and the prediction set can be used to verify prediction model based on the training. Li et al. [26] presented the effect of noise interference on the prediction in network search data and search data prediction ability will improve significantly after eliminating the high frequency noise on search data. In physics, Fourier transform, a common signal analysis method, is used to reveal the intrinsic relation between the time function and the spectral function in the signal and to extract the signal of the highest frequency sequence by using all the temporal information of the signal to deal with the high frequency noise. Figure 2 shows the filtering result. The search composite index after filtering is smoother than original sequence.
Based on the above, three 2019-nCov prediction models are established respectively in this paper: the prediction model based on the search composite index Index, the prediction model based on the search composite index for filtering noise reduction _ Index filtering and prediction model based on new suspected cases: To be specific, The suspected cases include people who are actually infected with the virus but cannot be detected temporarily because of current medicine, and have a strong correlation and explanation with the number of the future confirmed cases. First, the sequence was tested for ADF and the results show that all the original sequences were not steady, but there is no unit root in the first difference sequence at 5% significance level. First difference sequence is a stationary sequence, so Granger Non-Causal Test and ARMAX model can be supported. Table 2 shows the model results by training set data. From the above model fitting results, it can be seen that the prediction model based on filtering noise reduction has the best fitting effect among the three prediction model, followed by the prediction model based on search composite index. Major variables are all significant in 1%, which indicates that search composite index Index, search composite index after filtering and noise reduction _ Index filtering and NCPS are all have some explanatory and prediction ability to the new confirmed cases in the future.

Results
Based on the above training models, this paper predicts the new confirmed cases of 2019-nCov from February 3 to 9, 2020 in prediction set and evaluates the prediction effect of the three prediction models. Among many evaluation indexes for prediction effect, this paper chooses the common prediction error index-MAPE (Mean absolute Percentage Error) and RMSE (Root-Mean-Square Error). The predictive values of China 2019-nCov in the next seven days in prediction set are shown in Table 3.
It can be seen from the prediction results during the above prediction period that among the three prediction models, the model with the smallest prediction error is search composite index after filtering and noise reduction, whose MAPE and RMSE are 4.98% and 192.71 respectively. The second model is search composite index, whose MAPE and RMSE are 7.09% and 258.97 respectively. By contrast, the prediction accuracy of predicting new confirmed cases by the new suspected cases of the first four periods is the worst and the prediction lead time is shorter. Figure 3 shows the 2019-nCov original value and prediction value during the prediction period.
It can be concluded from the absolute error that the prediction accuracy of search composite index after filtering and noise reduction is the highest but the overall prediction trend is flat due to filter-out of high frequency noise. Therefore, the peak change can't be obtained in a short period. However, this defect can be well solved by search composite index Index. It can be seen from prediction value that the prediction value of search composite index at every turning point is overestimated or underestimated but the prediction accuracy of the short-term fluctuation is high. According to the above analysis, the prediction model of search composite index after filtering and noise reduction can be used to observe the epidemic trend in the long future period and the prediction model based on search composite index can be used to monitor the epidemic fluctuation in the short period.

Discussion
At the end of this paper, the change trend of the epidemic in the next 10 days from February 9 to 18 is further obtained according to the prediction model after evaluating the prediction effect of the prediction model by using the prediction set, and the results are shown in Figure 4.
It can be seen from the figure that the search composite index and index after filtering and noise reduction can be used for predicting the 2019-nCov in the coming 10 days and the new suspected cases can be used for predicting the 2019-nCov in the coming four days. The search composite index reached the peak on January 23, 25 and 28 respectively, which basically coincide with the three small peaks of new confirmed cases on February 4, 6 and 9 respectively. It embodies the apriority of the index. The index has been declining thereafter but met another search peak on February 6 and 7 respectively. It means the new confirmed cases of 2019-nCov in the coming period would also meet a small peak after continuous decline. The result of the prediction model displays the peak is expected to appear on February 16 and 17 respectively. The new suspected cases can be used for predicting the number of new confirmed cases in the coming 4 days only. The prediction result on February 10-13 is basically in consistency with the other two models, which means the third small peak will appear on February 9 and then the case will decline for a short period.

Conclusion
Under the current rigid status of 2019-nCov in the world, particularly in China, this paper uses Baidu search index to predict the viral pneumonia and brings forth three prediction models: (1) Prediction model based on Baidu search composite index; (2) Prediction model based on search composite index after filtering and noise reduction; (3) Prediction model based on the suspected cases increased every day. The three models can all be used to predict the new confirmed persons suffering 2019-nCov in the coming period. To be specific, the former two prediction models can be used to predict the new confirmed persons in the coming ten days while the third model can be used to predict the new confirmed persons in the coming 4 days. It can be concluded from the model prediction effect that search composite index based on filtering and noise reduction has the highest prediction accuracy for new confirmed cases but it is not sensitive to short-term fluctuation; the prediction accuracy based on search composite index is sensitive to prediction of peak and valley although its prediction error is larger than Model (2). Prediction accuracy based on new suspected cases is the worst. Therefore, Model (2) can be used to predict the epidemic change trend in the future and can also be used to monitor short-term epidemic fluctuation by working together with Model (1). By further predicting the new confirmed cases in the coming ten days beyond the prediction period at last, this paper predicts that the new confirmed cases of 2019-nCov will meet a small peak on February 16 and 17 respectively, after which, it will decline in an oscillated manner.
The prediction capacity of the current's prediction model is limited for the 2019-nCov, an emergent infectious disease. This paper just brings forth a prediction idea and the prediction precision need further improvement, which also implies the arduous task for epidemic forecast.

Availability of data and material
Data sharing is not applicable to this article as no datasets were generated or analysed during the current study.

Competing interests
The authors declare that they have no competing interests.

Funding
This study is supported by the Anhui Natural Science Foundation (grant number 1908085QG305). This funding bodies had no role in the study design, data collection, data analysis, data interpretation, or writing of the manuscript.

Authors' contributions
LXX WQ conceived, designed the experiments and analyzed the data. Lxx wrote the manuscript. LBF developed and evaluated the model. LXX WQ LBF edited and revised the manuscript. All authors read and approved the final manuscript.