- Predictive Power for Google Trend and Twitter
We focus on COVID–19 infections, Twitter tweets and Google search data for all 50 U.S. states, excluding Puerto Rico and the District of Columbia since some Internet data for these two regions are unavailable. Specifically, we collect the state-level daily COVID–19 confirmed cases from the New York Times. The number of COVID–19 related tweets in each individual state is extracted from an open COVID–19 Twitter chatter dataset [15]. We obtain the Google Trends index by using a combination of one of three keywords (‘coronavirus’, ‘COVID’ and ‘COVID19’) and a state’s full name as an integrated search term (e.g. ‘coronavirus Massachusetts’, ‘COVID California’), given that residents are usually more concerned about their local situation of the pandemic. More details of data used in this study are described in the Methods section.
In Fig. 1, we illustrate a comparison among the daily confirmed cases, number of COVID–19 related tweets and the Google Trends indexes (with different search terms) in New York, Massachusetts, Iowa and California. One can observe that the overall graph patterns are different between states. We then investigate the relationship between the COVID–19 pandemic spreading and the Internet data in all 50 U.S. states. Fig. 2 shows the lagged Spearman correlation between the Internet data from Twitter and Google Trends and the reported COVID–19 cases for the selected 4 states. To quantify the predictive power of the tweeting behavior and the search activity for an individual state, we denote 𝑐∗ as the highest correlation coefficient and 𝑙∗ as the optimal lag achieving 𝑐∗. In principle, a larger 𝑐∗ indicates a higher accuracy in predicting the state-specify pandemic. A larger 𝑙∗ corresponds to an earlier peak of Internet searches and tweeting about COVID–19, indicating that residents start being active on the Internet earlier. We find that 𝑐∗ and 𝑙∗ are quite different among different states and between Google Trends and Twitter (see Fig. 2). For instance, for New York, 𝑐∗ is merely 0.60 with 𝑙∗ = 15 using the Twitter data but is up to 0.95 with 𝑙∗ = 19 for tracking ‘coronavirus New York’ on Google Trends. For California, the 𝑐∗ of Twitter and of ‘coronavirus California’ on Google Trends are 0.67 and 0.81, respectively, while the 𝑙∗ for both is above 30 days.
Fig.3 presents the distribution of 𝑐∗ and 𝑙∗ for Twitter and Google Trends for all 50 U.S. states. The average of 𝑐∗ from Twitter is 0.64, while for Google Trend using the keyword ‘COVID’ 𝑐∗ is nearly 0.70. These results imply that the tweeting activity and search interest indeed have the capability to predict the COVID–19 spreading. On the other hand, the average 𝑙∗ on Twitter is about 26 days, revealing a smaller delay of the Twitter platform. Indeed, we find that 𝑐∗ and 𝑙∗ on Twitter are significantly correlated with 𝑝<0.001 (see the correlation coefficient between 𝑐∗ and 𝑙∗ in Supplementary Table 1), meaning that earlier collective tweeting may result in more accurate prediction. For Google Trends, the average 𝑙∗ of the keyword ‘coronavirus’ (27.0) is somehow larger than both ‘COVID’ (21.3) and ‘COVID–19’(24.5). An explanation could be that the majority of people searched by the word ‘coronavirus’ since the pandemic initially was reported under this name, while the names ‘COVID’ and ‘COVID–19’ were formally proposed by the World Health Organization at the end of February 2020.
- Correlation between 𝒄∗ and state conditions
We find that the wide difference of 𝑐∗ among the 50 states is partially related to a state’s economic and social conditions. Specifically, we consider population demographics, air traffic flow and the economic development level, which can be quantitively characterized by the following proxies. A state’s population size as of 2019 is estimated by the U.S. Census Bureau, along with the population density measured by number of residents per square mile. The air traffic flow is measured by enplanement (i.e., the number of passengers boarding) in 2017 and 2018 (see details in the Methods Section). Besides, we collect each state’s gross domestic product (GDP) as well as the GDP per capita as of 2019 4th quarter to measure economic output.
We calculate the Spearman correlation coefficient between these six variables and the 𝑐∗of Twitter volume and Google Trends index, finding a significantly positive correlation, as shown in Table 1. In particular,
the more people, the higher population density, the higher air traffic and wealth a state has, the more accurate the Twitter and Google Trends predict the COVID–19 pandemic. This makes intuitive sense, as higher income is correlated with higher education, and higher geographic mobility leads to a higher information exchange, both raising early awareness of the pandemic. There is no significant correlation between 𝑙∗ and the states’ demographic variables.
|
c*
|
|
Twitter
|
Google Trend
(coronavirus)
|
Google Trend
(COVID)
|
Google Trend
(COVID-19)
|
Population size (2019)
|
0.505***
|
0.210
|
0.340*
|
0.573***
|
Population density (2019)
|
0.374**
|
0.302*
|
0.414**
|
0.473***
|
Enplanements (2018)
|
0.303*
|
0.355*
|
0.416**
|
0.609***
|
Enplanements (2017)
|
0.301*
|
0.360*
|
0.421**
|
0.610***
|
GDP (2019 Q4)
|
0.535***
|
0.229
|
0.374**
|
0.599***
|
GDP per capita (2019 Q4)
|
0.244
|
0.379**
|
0.517***
|
0.432**
|
|
|
Table 1. Correlation coefficient between c* and states’ variables in terms of population demographics, air traffic flow and the economic development level (N = 50). The significance level is denoted by stars in red: * 𝑝<0.05, ** 𝑝<0.01, *** 𝑝<0.001
- Correlation between early infected rate and 𝒄∗/𝒍∗
We further figure out the effect of an actively engaged population on the outbreak of the infection. Specifically, we focus on the early stage of the COVID–19 outbreak in the 50 U.S. states, a period when the government had not started yet to take serious control measures. The infection rate in this stage is a reasonable proxy to measure the extent to which a state’s residents rely on their individual awareness to protect themselves again the pandemic. Quantitively, we define the early infection rate as the proportion of residents being infected in the earliest 𝑇 days since the state-level first case was confirmed (see the distribution of the early infection rate among 50 states in the Supplementary Figure 2).
Having both the predictive capacity of Internet search and Twitter data and the early infection rate, we are able to find the relationship between the two. Surprisingly, we discover a strong negative correlation between 𝑙∗ and the early infection rate, with 𝑇 varying from 1 week to 3 weeks, as shown in Table 2. This relationship indicates that the earlier people start tweeting and searching, the lower is the infection rate. In other words, the earlier the collective awareness on Twitter and on Google search, the less people get infected when the virus outbreaks. Moreover, we also find a significantly negative correlation between 𝑐∗ and the infection rate using the Twitter data and Google Trends for the terms ‘COVID’ and ‘COVID- 19’ on selected 𝑇′𝑠 (see Table 2), implying that the more predictive pro-active Internet-search behavior is, the lower the initial infected rate.
Table 2. Correlation coefficient between early infection rate for different T (number of days) and l* and c* from Internet data (N = 50). Similar to Table 1, the red stars represent the significance level.
|
Early infection rate
|
|
T=7
|
T=14
|
T=21
|
l*
|
Twitter
|
-0.371**
|
-0.405**
|
-0.416**
|
Google Trend
(coronavirus)
|
-0.471***
|
-0.517***
|
-0.500***
|
Google Trend
(COVID)
|
-0.473***
|
-0.517***
|
-0.505***
|
Google Trend
(COVID-19)
|
-0.445**
|
-0.516***
|
-0.522***
|
c*
|
Twitter
|
-0.566***
|
-0.593***
|
-0.476***
|
Google Trend
(coronavirus)
|
-0.139
|
-0.08
|
-0.100
|
Google Trend
(COVID)
|
-0.371**
|
-0.374**
|
-0.167
|
Google Trend
(COVID-19)
|
-0.543***
|
-0.510***
|
-0.270
|