On the predictability of COVID-19 in USA: A Google Trends analysis

During the difficult times that the world is facing due to the COVID-19 pandemic that has already had severe consequences in all aspects of our lives, it is imperative to explore novel approaches of monitoring and forecasting the regional outbreaks as they happen or even before they do. In this paper, the first approach of exploring the role of Google query data in the predictability of COVID-19 in the US at both national and state level is presented. The results indicate that Google Trends correlate with COVID-19 data, while the estimated models exhibit strong predictability of COVID-19. In line with previous work that has argued on the value of online real-time data in the monitoring and forecasting epidemics and outbreaks, it is evident that such infodemiology approaches can assist public health policy makers, in order to address the most crucial issue; that of flattening the curve, allocating health resources, and increasing the effectiveness and preparedness of the respective health care systems.


Introduction
In December 2019, a novel coronavirus of unknown source was identified in a cluster of patients in the city of Wuhan, in Hubei, China [1]. The outbreak first came to international attention after WHO reporting of a cluster of pneumonia cases on Twitter on January 4 th [2], followed by an official report on the 5 th [3]. China reports its first COVID-19 related death on January 11 th , while on the 13 th , the first case outside China was identified [4]. On January 14 th , the World Health Organization (WHO) tweeted that Chinese preliminary investigations reported that no human-to-human transmission had been identified [5]. However, the virus quickly spread to other Chinese regions and neighboring countries, while Wuhan, which was identified as the epicenter of the outbreak, was cut off by the authorities on January 23 rd , 2020 [6]. On January 30 th , WHO declared the epidemic as a public health emergency [1], and the disease caused by the virus, received its official naming, COVID-19, on February 11 th [7].
The first serious COVID-19 outbreak in Europe was identified in northern Italy in February, with the country having its first death on the 21 st [8]. The novel coronavirus was transmitted to all parts of Europe within the next few weeks, resulting in WHO declaring COVID-19 a pandemic on March 11 th , 2020.
As of April 18 th , 2020, 16:48 GMT [9], there have been 2,287,369 confirmed cases worldwide, with 157,468 confirmed deaths, and 585,838 recovered. The most affected countries with more than 100K cases (in absolute numbers, not divided by population) are: USA with 715,105 confirmed cases and 37,889 deaths; Spain with 191,726 confirmed cases and 20,043 deaths; Italy with 175,925 confirmed cases and 23,227 deaths; France with 147,969 confirmed cases and 18681 deaths; Germany with 142,614 confirmed cases and 4405 deaths; and the UK with 114,217 confirmed cases and 15,464 deaths, as depicted in Figure 1 that consists of the heat maps for the worldwide cases and deaths by country. As evident, Europe is severely hit by COVID-19; however, the spread of the disease now indicates that the center of the epidemic has moved to the US, which is the most affected country in terms of cases and deaths, with the state of New York counting more than 240K cases and 17K casualties. Figure 2 shows the distribution of the COVID-19 cases and deaths in the US by state, as of April 18 th , 2020. Towards the direction of finding new methods and approaches for disease surveillance, it is crucial to make use of real time internet data. Infodemiology, i.e., information epidemiology, is a concept introduced by Gunther Eysenbach [10][11]. In the field of infodemiology, internet sources and data are employed in order to inform public health and policy [12][13], and are valuable for the monitoring and forecasting of outbreaks and epidemics [14], as for example Ebola [15], Zika [16], MERS [17], influenza [18], and measles [19][20].

Cases Deaths
During this pandemic, several approaches in using Web based data have been already published in this line of research. Google Trends, the most popular infodemiology source along with Twitter, has been widely used in health and medicine for the analysis and forecasting of diseases and epidemics [21]. As of April 20, 2020, already seven (7) papers on the topic of tracking and forecasting COVID-19 using Google Trends data have been published, according to PubMed (advanced search: covid AND google trends) [22], monitoring, analyzing, or forecasting COVID-19 in several regions like Taiwan [23], China [24][25], Europe [26][27], USA [27][28], Iran [27,29]. Note that for Twitter publications related to the COVID-19 pandemic, eight papers (8) are online up to this point (PubMed advanced search: covid AND twitter [22]), published from March 13 to April 20, 2020 [30][31][32][33][34][35][36][37]. Table 1 consists of the systematic reporting of COVID-19 Google Trends studies, in the order of the reported publication date. In this paper, USA Google Trends data on the topic of "Coronavirus (Virus)" are employed at both national and state level, in order to explore the relationship between COVID-19 data and the online interest on the virus. At first, the correlations between Google Trends and COVID-19 data are calculated, followed by exploring the role of Google Trends data in the predictability of COVID-19. To the best of our knowledge, this is the first attempt of this kind in the US.
The rest of the paper is structured as follows: the Methods section details the procedure of the data collection and the statistical analysis tools and methods, the Results section includes the nowcasting models at both national and state level, and the Discussion section consists of the main findings of this work, along with the limitations and future research suggestions.

Methods
Data from the Google Trends platform are retrieved in .csv [38]. Data are normalized over the selected period and Google Trends reports the adjustment procedure as follows: "Search results are normalized to the time and location of a query by the following process: Each data point is divided by the total searches of the geography and time range it represents to compare relative popularity. Otherwise, places with the most search volume would always be ranked highest. The resulting numbers are then scaled on a range of 0 to 100 based on a topic's proportion to all searches on all topics. Different regions that show the same search interest for a term don't always have the same total search volumes" [39]. The methodology for the data collection is designed based on the Google Trends Methodology Framework in Infodemiology and Infoveillance [40]. Note that data may slightly vary based on the time of retrieval.
For the keyword selection, the online interest in all commonly used variations of referring to the virus are examined and compared, i.e., "Coronavirus (Virus)"; "COVID-19 (Search term)"; "SARS-COV-2 (Search term)"; "2019-nCoV (Search term)"; "Coronavirus (Search term)". Only "Coronavirus (Virus)" and "Coronavirus (Search term)" yield significantly high online interest, which is also quite expected. Between the two, i.e., the Topic (Virus) and the Search term, the "Coronavirus (Virus)" is selected for further analysis.
Data for the worldwide distribution of the COVID-19 cases and deaths are retrieved from Worldometer [9], and maps on COVID-19 cases and deaths are recreated by the authors using the free online tools Pixelmap [41] and Chartsbin [42]. Data for the US analysis on COVID-19 are retrieved by "The COVID Tracking Project", providing detailed structured data on COVID-19 cases and deaths nationally and at state level [43].
As Google Trends data are normalized the time frame for which search traffic data are retrieved should exactly match the period for which COVID-19 data are available. Therefore, the timeframes for which analysis is performed is different for the states, starting either on March 4 th or on the date for which the first confirmed case is identified in each state, as shown in Table 2: Each variable used in this study is divided by its full-sample standard deviation, estimated or calculated based on the basic formula of standard deviation of a variable. By doing this, the inherent variability of each variable was moved, and thus all of them have a standard deviation equal to 1. This allows us to compare the strength of the impact of explanatory variables used on the dependent variable. The non-parametric [44] unit root test is also applied, in order to reveal whether or not both variables are stationary. The results suggest that both variables can be used directly without further transformation in the present analysis.
The first step towards exploring the role of Google Trends in the predictability of COVID, is to examine the relationship between Google Trends and COVID-19 incidence. To this direction, the Pearson correlation coefficients (r) between the ratio (COVID-19 Deaths)/(COVID-19 Cases) and Google Trends data are constructed. In particular, a minimum variance bias-corrected Pearson correlation coefficient [45][46] via a bootstrap simulation is applied, in order to deal with the limited number of observations, and thus, with the small sample estimation bias (also see [46]). The biascorrected bootstrap coefficient " ! for the Pearson correlation is given by: where corresponds to the length of the bootstrap samples; in this case set equal to 999. Next, predictive analysis for USA and all US states (plus DC) is performed. The predictive model is a quantile regression, considered to be a robust regression analysis against the presence of outliers in the sample; introduced by Koenker and Bassett [47]. Building on the study implemented by Karlsson [46], a bias corrected via balanced bootstrapping quantile regression is employed. Such a model is the appropriate statistical approach to mitigate the small sample estimation bias and the present of outliers in the dataset, as it combines the advantages of bootstrap standard errors and the merits of quantile regression.
More specifically, let ' , where ∈ , be a time series representing the dependent variable, supposing a bivariate specification. A quantile regression estimates the impact of explanatory variable ' , where ∈ , on the variable ' at different points of conditional -quantile, where ∈ (0,1), of the conditional distribution. A value of -quantile close to zero and a value of -quantile close to one, represent the left (lower) and the right (upper) tail of the conditional distribution, respectively. The conditional quantile function is defined by: (|* ( ) = ′ + given the distribution of ' , the estimation of the conditional quantile functions + can be obtained by solving the following minimization problem: where + ( ) = A − 1 {012} B represents the loss function. By minimizing the sample analog { # , … , 4 } that corresponds to a '5 quantile sample, the estimator + takes the form: where ' is an approximation to the conditional -quantile of the variable ' . In our analysis, ' stands for the ratio (COVID-19 Deaths)/(COVID-19 Cases), '"# is the respective Google Trends value in lag order, and = 1, … , , with being the respective number of observations. A linear trend is also used.
Finally, the bias corrected parameter estimate is estimated as: where the P = K 8 ( )? is given by "# ∑ K 8 * ( ) − % !&# K 8 ( ) and ∈ (0, 1) stands for the quantile considered; in this case set equal to 0.5 (median). A median regression is considered as more robust to outliers than, for example, least squares regression, and it also avoids assumptions about the error parametric distribution (see [48]).

Results
In Figure 3, the worldwide and US online interest in terms of Google queries in the "Coronavirus (Virus)" Topic from January 22 nd to April 15 th , 2020, is depicted, showing that said topic is very popular, and especially in Europe and in North America, where, in the US, the interest is significantly high -i.e. above 70-for all US states. Following, the correlations between Google Trends and COVID19 data are calculated. Table 3 consists of the Pearson correlation analysis, while Figure 4    Proceeding with the results of the predictive analysis, Table 4 consists of the estimated models for the US and for each US state (plus DC), and Figure 5 depicts the heat map for by state. Due to low number of observations, the states of Maine, Montana, North Dakota, West Virginia, and Wyoming were not included in the predictive analysis results, however included in the heat map for uniformity reasons. As is evident, the estimated Google Trends models exhibit strong COVID-19 predictability. Parenthesis reports the standard errors; t-statistics are given in brackets.

Figure 5.
Heat map of the predictive analysis models' statistical significance.

Discussion
In light of the COVID-19 pandemic and towards finding new ways of forecasting the disease spreading, in this study Google Trends data on the "Coronavirus (Virus)" Topic were used in order to explore the predictability of COVID-19 in the US. At first, statistically significant correlations were observed for the US and several US states -as shown in a more elaborate depiction of a spider web chart of said correlations in Figure 6-, which is in line with previous studies that have suggested that correlations are observed between Google Trends and COVID-19 data.  Figure 7 consists of the graph of the COVID-19 Deaths/Cases ratio and the respective Google Trends normalized data in the US from March 4 th to April 15 th , 2020. For graph consistency purposes, the COVID-19 Deaths/Cases ratio is normalized on a 0-100 scale. As depicted in the graph and also confirmed by the predictive analysis, it is evident that the two variables are not linearly dependent, rather than have an inversely proportional relationship, meaning that as COVID-19 progresses, the online interest decreases. In sense and from a behavioral point of view, this can be explained as follows: The interest started increasing at first and reached a peak as the confirmed cases reached a high number and as deaths rates started exhibiting the real threat of this pandemic, while after a while the interest has an inverse course, which could also be indicating that the public can be overwhelmed by all this information overload and turns to decreased information intake. The spike in Google queries and the decline in the ratio of COVID-19 Deaths/Cases, could be due to the spreading of the virus over these days and the "delay" in deaths, i.e., cases increasing while total number of deaths has not started significantly increasing yet.
The latter is in line with the recent publication of Mavragani [26], that suggested that, though significant correlations between COVID-19 and Google data are observed, they tend to decrease both in strength and significance as time moves forward in regions that have been affected by COVID-19, because the interest decreases. This counter-intuitively happens before the cases' and deaths' curves start exhibiting a downward trend, i.e. when a region is being heavily affected, independently of having or not reached its peak yet. However, it would be interesting to explore the relationship from this point onwards, since, as shown in the graph, the lines meet, which could indicate a future change in the relationship dynamics when deaths peak at a later point and also when they start their downward course.
This study has limitations. At first, only data from Google Trends were considered. Though this is the most popular search engine, some data on the topic of Coronavirus from other search engines were not included in this analysis. Second, data at this point are very limited, thus the results are based on fewer observations. Third, the 51 states exhibit diversity in terms of confirmed cases and deaths, thus any conclusions drawn from this analysis refer to each case individually. Despite the known limitations of online search traffic data though, using Infodemiology metrics in informing public health and policy in general and for the monitoring of outbreaks and epidemics in specific, has received wide attention recently, with several successful attempts of forecasting disease spreading.
Towards exploring the dynamic of finding determinants of COVID-19, the predictive analysis in this study gives insight on how online search traffic data can play a significant part in forming public health policies, especially in times of epidemics and outbreaks, when real-time data are essential. With the COVID-19 pandemic, the world is in uncharted territory, in scientific, financial, and social terms. This calls for immediate action and open research and data, and the term "multidisciplinary" has never before been more important. To this direction, the role of big data in providing "opportunities for performing modeling studies of viral activity and for guiding individual country healthcare policymakers to enhance preparation for the outbreak" has been acknowledged [49], and current research on the subject should focus on both exploring the role of more infodemiology variables as well as combine infodemiology with traditional sources, in order to explore the full potential of what online, real-time data have to offer to disease surveillance.