An Augmented Multilingual Twitter Dataset for Studying the COVID-19 Infodemic



We present an openly available dataset to facilitate researchers’ exploration of popular discourse about the COVID-19 pandemic. The dataset, whose collection is ongoing, currently consists of over 780 million tweets, from all over the world, in multiple languages. Tweets start from 22 January 2020, when the total cases of reported COVID-19 were below 600 worldwide. The dataset was collected using the Twitter API and by rehydrating tweets from another openly available database. To facilitate access for other researchers, the English-language tweet data has been augmented by state-of-the-art Twitter sentiment and named entity recognition algorithms. The dataset and the summary files we provide allow researchers to avoid some computationally intensive analyses, facilitating more widespread use of social media data to gain insights on issues such as (mis)information diffusion, semantic networks, sentiment, and the evolution of COVID-19 discussions. The insights extracted from such analyses could help inform policy and advocacy work amid the current and future pandemics.

Full Text

Due to technical limitations, full-text HTML conversion of this manuscript could not be completed. However, the latest manuscript can be downloaded and accessed as a PDF.