More Active Internet-Search on Google and Twitter Posting for COVID-19 Corresponds with Lower Infection Rate in the 50 U.S. States

As the novel coronavirus disease 2019 (COVID-19) continues to rage worldwide, the United States has become the most affected country with more than 2.5 million total conrmed cases up to now (June 2, 2020). In this work, we investigate the predictive power of online social media and Internet search for the COVID-19 pandemic among 50 U.S. states. By collecting the state-level daily trends through both Twitter and Google Trends, we observe a high but state-different lag correlation with the number of daily conrmed cases. We further nd that the predictive accuracy measured by the correlation coecient is positively correlated to a state’s demographic, air trac volume and GDP development. Most importantly, we show that a state’s early infection rate is negatively correlated with the lag to the previous peak in Internet search and tweeting about COVID-19, indicating that the earlier the collective awareness on Twitter/Google in a state, the lower is the infection rate. Correlation analysis. The Spearman correlation is employed in this study using Python’s SciPy function. Specically, we conduct lagged correlation analyses to assess the temporal relationships between Internet data and COVID–19 pandemic. For each state, we right-shift the daily Internet data from Twitter and Google Trends (with different search terms) by a variable lag and calculate the Spearman correlation to the daily reported COVID–19 cases. The maximum lag is set to 40 days. Spearman correlation is also used to examine the correlation between the 𝑐 ∗ and the state’s variables, and between the 𝑐 ∗ / 𝑙 ∗ and the early infection rate, at signicance levels from * 𝑝 <0.05 to *** 𝑝 <0.001.


Introduction
"At every crucial moment, American o cials were weeks or months behind the reality of the outbreak. Those delays likely cost tens of thousands of lives". NYT June 26, 2020 [1] Since the beginning of January 2020, the world has been turned upside down. Nothing is like it was before since the novel coronavirus disease was rst reported in Wuhan, China, in December 2019 [2]. After initial blunders, China took energetic measures to combat the virus (e.g. the Wuhan shutdown) [3] while the Western world was still mostly complacent. Although epidemiologists have already warned at the end of January that COVID-19 would probably turn into a global crisis [4], politicians and the population in the US and Western Europe alike initially ignored the problem. The virus was seen as something far away, that like SARS and the avian u would be active mostly in the high-density populations of Asia and then go away. And even when Italy was shaken with a virulent COVID-19 outbreak in February [5], which closed down the northern industrial heartland of Veneto, the US authorities were still mostly ignoring the problem [6]. Only when in mid-March New York started seeing soaring infection rates, did the population and the politicians start taking the disease seriously. This behavior is perfectly re ected in the Google search trend and the Twitter activity, motivating our research question: Is a state or political entity better capable of dealing with an infectious disease if the collective awareness is raised early on in the course of the disease? Does a population actively searching for information about COVID-19, and showing a robust dialog on Twitter about this topic deal more e ciently with the disease?
It has been illustrated that data from the online social media and Internet searches are correlated with several epidemics that have previously happened, such as seasonal in uenza epidemics [7], Dengue [8], MERS [9] and H1N1 [10]. Regarding COVID-19, several works [11][12][13][14] have demonstrated signi cant correlation between the Internet search and the pandemic spreading among different countries. However, it still remains unclear what regional factors the Internet's predictive abilities may relate to, and whether they are useful surveilling the spread of the disease. If there is indeed predictive power in the Google search and tweeting behavior of a US polity such as a state or a city, it will give invaluable input to policymakers, governments, and healthcare providers to better prepare and deal with potential future waves of the COVID-19 and other epidemics.
In this work, using the data from 50 states of the United States, we conduct a comparative study about the role of online social media and search trends in the COVID-19 epidemic. We show that the daily number of COVID-19 related tweets in Twitter exhibits a strong but state-different lag correlation with newly con rmed cases. The same can be observed on the Google Trends index using coronavirus-related search terms. These state-differences in predictive capabilities in terms of correlation strength and lag are closely related to a state's demographics, quantitively measured by a state's population size and density, air tra c volume and economic development. Further, our analysis on the state-level early COVID-19 incidents demonstrates a signi cantly negative correlation between the lag and the early infection rate, implying that an actively engaged population that searches for information and tweets about COVID-19 more ahead of the outbreak indicates a lower infection rate.

Predictive Power for Google Trend and Twitter
We focus on COVID-19 infections, Twitter tweets and Google search data for all 50 U.S. states, excluding Puerto Rico and the District of Columbia since some Internet data for these two regions are unavailable.
Speci cally, we collect the state-level daily COVID-19 con rmed cases from the New York Times. The number of COVID-19 related tweets in each individual state is extracted from an open COVID-19 Twitter chatter dataset [15]. We obtain the Google Trends index by using a combination of one of three keywords ('coronavirus', 'COVID' and 'COVID19') and a state's full name as an integrated search term (e.g. 'coronavirus Massachusetts', 'COVID California'), given that residents are usually more concerned about their local situation of the pandemic. More details of data used in this study are described in the Methods section.
In Fig. 1, we illustrate a comparison among the daily con rmed cases, number of COVID-19 related tweets and the Google Trends indexes (with different search terms) in New York, Massachusetts, Iowa and California. One can observe that the overall graph patterns are different between states. We then investigate the relationship between the COVID-19 pandemic spreading and the Internet data in all 50 U.S. states. Fig. 2 shows the lagged Spearman correlation between the Internet data from Twitter and Google Trends and the reported COVID-19 cases for the selected 4 states. To quantify the predictive power of the tweeting behavior and the search activity for an individual state, we denote * as the highest correlation coe cient and * as the optimal lag achieving * . In principle, a larger * indicates a higher accuracy in predicting the state-specify pandemic. A larger * corresponds to an earlier peak of Internet searches and tweeting about COVID-19, indicating that residents start being active on the Internet earlier. We nd that * and * are quite different among different states and between Google Trends and Twitter (see Fig. 2). For instance, for New York, * is merely 0.60 with * = 15 using the Twitter data but is up to 0.95 with * = 19 for tracking 'coronavirus New York' on Google Trends. For California, the * of Twitter and of 'coronavirus California' on Google Trends are 0.67 and 0.81, respectively, while the * for both is above 30 days.

Correlation between * and state conditions
We nd that the wide difference of * among the 50 states is partially related to a state's economic and social conditions. Speci cally, we consider population demographics, air tra c ow and the economic development level, which can be quantitively characterized by the following proxies. A state's population size as of 2019 is estimated by the U.S. Census Bureau, along with the population density measured by number of residents per square mile. The air tra c ow is measured by enplanement (i.e., the number of passengers boarding) in 2017 and 2018 (see details in the Methods Section). Besides, we collect each state's gross domestic product (GDP) as well as the GDP per capita as of 2019 4th quarter to measure economic output.
We calculate the Spearman correlation coe cient between these six variables and the * of Twitter volume and Google Trends index, nding a signi cantly positive correlation, as shown in Table 1. In particular, the more people, the higher population density, the higher air tra c and wealth a state has, the more accurate the Twitter and Google Trends predict the COVID-19 pandemic. This makes intuitive sense, as higher income is correlated with higher education, and higher geographic mobility leads to a higher information exchange, both raising early awareness of the pandemic. There is no signi cant correlation between * and the states' demographic variables. Correlation between early infected rate and * / * We further gure out the effect of an actively engaged population on the outbreak of the infection. Speci cally, we focus on the early stage of the COVID-19 outbreak in the 50 U.S. states, a period when the government had not started yet to take serious control measures. The infection rate in this stage is a reasonable proxy to measure the extent to which a state's residents rely on their individual awareness to protect themselves again the pandemic. Quantitively, we de ne the early infection rate as the proportion of residents being infected in the earliest days since the state-level rst case was con rmed (see the distribution of the early infection rate among 50 states in the Supplementary Figure 2).
Having both the predictive capacity of Internet search and Twitter data and the early infection rate, we are able to nd the relationship between the two. Surprisingly, we discover a strong negative correlation between * and the early infection rate, with varying from 1 week to 3 weeks, as shown in Table 2. This relationship indicates that the earlier people start tweeting and searching, the lower is the infection rate. In other words, the earlier the collective awareness on Twitter and on Google search, the less people get infected when the virus outbreaks. Moreover, we also nd a signi cantly negative correlation between * and the infection rate using the Twitter data and Google Trends for the terms 'COVID' and 'COVID-19' on selected ′ (see Table 2), implying that the more predictive pro-active Internet-search behavior is, the lower the initial infected rate.

Conclusion
In conclusion, this study showed that there is a high but state-different correlation between the results of Google search and tweeting about COVID-19 related keywords and the number of con rmed COVID-19 cases among 50 U.S. states. These signi cant correlations occur as early as 27 days before con rmation of the infections, indicating the usefulness of Internet search and online social media tracking to surveil the pandemic's outbreak locally. We further found that the differences in predictive power between these states are closely related to a state's demographics characterized by population size and density, air tra c and economic development. Most importantly, we discovered that if there is an actively tweeting population which leads a vibrant dialog on Twitter about COVID-19, the early infection rate will be lower.
Similarly, the more ahead of the outbreak a population starts googling for COVID-19 information, the lower the early infection rate.
Methods C -Cases in U.S. We collect the COVID-19 con rmed cases from the New York Times (https://www.nytimes.com/), based on reports from state and local health agencies. 50 U.S. states' daily number of cases are used in this study. For each state, the study period is from the date of the rst con rmed case in this state to June 2, 2020.
CTwitter ata. The COVID-19 tweets on Twitter are acquired from an open COVID-19 Twitter chatter dataset [15], which is a collection of the identi ers of tweets speci cally using coronavirus-related keywords (coronavirus, 2019nCoV, COVD19, CoronavirusPandemic, CoronaOutbreak, etc.), starting from January 27, 2020. After hydrating the full JSON objects from these tweets' identi ers, we extract the daily number of tweets in the U.S. at state level according to a tweet's location. Speci cally, we rst identify all geo-located tweets (i.e., tweet associated with a geographic place), only retaining tweets with a location in the US. Then we assign a tweet to a state using its speci c location, such as city and town (see the heatmap of the number of available geo-located tweets in 50 U.S. states in the Supplementary Figure 1). to the daily reported COVID-19 cases. The maximum lag is set to 40 days. Spearman correlation is also used to examine the correlation between the * and the state's variables, and between the * / * and the early infection rate, at signi cance levels from * <0.05 to *** <0.001.
Proxy of air tra c ow. Using the Air Carrier Activity Information System database (https://www.faa.gov/airports/planning_capacity/passenger_allcargo_stats/passenger/collection/),, we obtain the enplanement data at every commercial service airport in U.S. for 2017 and 2018. As a proxy of a state's air tra c ow we calculate the sum of the enplanements of all airports located in a state.

Declarations
Code availability   Illustration of lagged correlation between new con rmed COVID-19 infections and data from Google Trends and Twitter in selected 4 states.

Figure 3
Distribution of * and * over 50 states for (a) Twitter and (b-d) Google Trends with different keywords.