Using social media data to assess the impact of 1 COVID-19 on mental health in China

11 The outbreak and rapid spread of COVID-19 not only caused an adverse impact on physical health but 12 also brought about mental health problems among the public. To assess the causal impact of COVID-19 13 on psychological changes in China, we constructed a city-level panel data set based on the expressed 14 sentiment in the contents of 13 million geotagged tweets on Sina Weibo, the Chinese largest microblog 15 platform. Applying a difference-in-differences approach, we found a significant deterioration in mental 16 health status after the occurrence of COVID-19. We also observed that this psychological effect faded 17 out over time during our study period and was more pronounced among women, teenagers and older 18 adults. The mental health impact was more likely to be observed in cities with low levels of initial mental 19 health status, economic development, medical resources, and social security. Our findings may contribute 20 to the understanding and control of COVID-19’s mental health impact.


22
The epidemic of coronavirus disease 2019  has become a severe public health crisis 1 . In 23 addition to the adverse impact on physical health, the outbreak and rapid spread of COVID-19 have also 24 brought about mental health problems among the public, such as anxiety and depression 2-4 . To capture 25 the psychological problems during the COVID-19 epidemic, online questionnaires and surveys are 26 widely used in ongoing studies [5][6][7][8][9] . Researchers detect the symptoms of mental illness and identify risk 27 factors by asking participants to answer well-designed questions and report their characteristics. The 28 challenge of these traditional methods is that it is difficult to monitor the mental health condition in real 29 time and understand its dynamic changes 10,11 . The large-scale and real-time data generated by the 30 widespread use of social media provide an approach to overcome these problems. By applying Natural 31 the vulnerable groups. 134 In the second heterogeneity analysis, we investigate whether the psychological effect of COVID-135 19 varies across different types of cities. We first collected socio-economic statistics reported in the 2019 136 China City Statistical Yearbook 35 for the cities in our data, such as regional GDP and the number of 137 hospitals (see 'Data' in Methods). For the initial mental health status, we measured it by using the median 138 sentiment value of tweets posted in each city during the first week of our study period. Then our data 139 were partitioned into High and Low based on the median value for each factor. For example, if the 140 regional GDP in a city is lower than the median GDP, it falls into a low GDP group, otherwise a high 141 GDP group. The psychological effect was estimated separately using equation (1) based on data in each 142 subgroup. We expect that the deterioration of mental health after COVID-19's occurrence is more likely 143 to be observed in cities with low levels of economic development, medical resources, and social security, 144 since these areas own poor financial, material and human support in the fight against this epidemic and 145 the provision of mental health service. Our conjecture is confirmed in Fig. 5a-c: the negative effect is 146 more notable in the low group. In Fig. 5d, we find that cities with poor initial mental health status are 147 more susceptible to the psychological impact of COVID-19, so more related measures should be taken 148 in these areas after the occurrence of epidemic. 149

150
In addition to the physical harm, the outbreak and rapid spread of COVID-19 has caused some additional the influence of this unprecedented event, we need to quantify these additional effects and this paper is 153 an essential component. Our findings in this study could contribute to answering three research questions 154 related to COVID-19's mental health impact. 155 First, does COVID-19 has a causal effect on the psychological changes reflected on social media in 156 China? Applying a DiD approach on a comprehensive panel data set, our analyses reveal a deterioration 157 in mental health status caused by the occurrence of COVID-19 among users on Sina Weibo, the Chinese 158 equivalent of Twitter. This finding is robust in a set of robustness checks. However, the mental health 159 measure is derived from the people who post tweets on social media. Although this group contains a 160 large number of people, we acknowledge that it is not randomly drawn from the full population. Little 161 children and people who are very old are less likely to use Sina Weibo 18,36 , and these individuals in fact 162 may be more vulnerable to the COVID-19's psychological effect 37,38 . Therefore, our results may 163 underestimate the overall adverse effect of COVID-19 epidemic on the mental health status of a 164 representative sample of the full population. 165 This finding also provides new evidence that the expressed sentiment by Chinese social media 166 users could provide a real-time spatiotemporal indicator of how the public's psychological status changes 167 during the epidemic. Because of the embarrassing attitude, poor recognition of mental illness, low 168 perceived need for treatment, and the limited knowledge of available services, a large number of people 169 with mental health problems have not been detected in China 39 . Under these circumstances, it is an 170 effective approach for the governments and policymakers to monitor the psychological response in real 171 time on the social media and then provide timely mental health services. For example, the social media 172 platform could easily evaluate a user's mental health status by sentiment analysis and take the initiative over time? The results of our relative time model show that the effect of COVID-19 on mental health is 177 likely to fade out during our study period. But our results do not allow us to draw any conclusion that 178 the psychological effect will disappear in the long term although the epidemic in China has been almost 179 controlled. The end of this COVID-19 epidemic could not mean the disappearance of its effect on mental 180 health among the public. The socio-economic effects caused by COVID-19, like economic recession and 181 social inequalities, are also harmful to our mental health status in the post-epidemic era, which might last 182 for a long period of time 40 . Besides, a group of people may have difficulties in adjusting back to normal 183 life when the epidemic is over, such as the students 41 . For example, during the COVID-19, students have 184 to adapt themselves to online study. However, if the schools are reopened, they have to readjust to the 185 traditional classes. The frequent shifts in lifestyle could bring about further psychological problems. The 186 assessment of these subsequent impacts on mental health may be complex and need further rigorous 187 analysis. 188 Third, does the effect of COVID-19 on mental health vary across different population groups and 189 cities? Our first heterogeneity analysis shows that the psychological effect is more pronounced among 190 women, teenagers (younger than 18 years old) and older adults (older than 45 years old). Thus, we should 191 pay more attention to these vulnerable people when providing mental health services. Nevertheless, we 192 are unable to capture the heterogeneous effects on little children and people who are very old, due to the 193 limitation of the age distribution of Weibo users 18,36 . Traditional questionnaires and surveys may be better 194 methods to investigate the psychological impact on these population groups. The results of the second 195 heterogeneity analysis imply that COVID-19's mental health impact is more likely to be observed in 196 cities with low levels of initial mental health status, economic development, medical resources, and social 197 security. So, people with poor mental health status before COVID-19 and those living in underdeveloped 198 areas that lack financial, material and human support could suffer more serious mental health problems. 199 This finding may help the government to grasp the point in decision making. For example, when 200 allocating public resources and providing mental health support, giving priority to these areas at high 201 risk may make the inputs produce more benefits. Additionally, the heterogeneity analysis also reminds 202 us of the important role of the economic state, medical resources, and social security in mitigating the 203 negative psychological effect. 204 We conclude this paper by pointing out several directions for future research. First, we only focus 205 on the text of tweets. However, some tweets contain other types of valuable data, such as pictures and 206 videos, which provide rich information 42 . More further studies are needed to extract sentiment from them 207 and take advantage of these data to measure the psychological response more accurately. Besides, bad 208 mental health status could lead to subsequent severe consequences, like suicide behaviour 43 . This 209 suggests the need to collect related data to quantify the causal impact of COVID-19 on these adverse 210 outcomes. In addition, the outbreak of COVID-19 simultaneously brought about infodemic 44 . The rapid 211 spread of misinformation through social media platforms may also affect mental health, and assessing 212 this phenomenon is a meaningful task 45 . We believe that our findings in this study, together with future 213 research, will assist the understanding of COVID-19's mental health impact and yield useful insights on 214 how to make effective psychological interventions in this kind of sudden public health event.
largest microblog platform in China. Large-scale data access is difficult for Weibo because of the 218 limitation of its application programming interface (API) 46 . Our Weibo data were obtained based on a 219 pool of 20 million active users 25 , which was selected from over 250 million Weibo users generated by 220 snowball sampling. We collected all geotagged tweets of these active users between January 1, 2020 and 221 March 1, 2020. Geotagged tweets mean that the users share their location information based on the exact 222 latitude and longitude when they post these tweets. Then, 13 million geotagged tweets in mainland China 223 during our study period were selected, including the gender and age information of their users. 224 Using these data, we conducted our sentiment analysis by applying the SKEP model 17 from Baidu 225 Senta (an open-source python library) published in 2020, which integrated sentiment knowledge into 226 pre-trained models and achieved new state-of-the-art results on most of the test datasets. For each tweet, 227 the sentiment analysis could return two probabilities representing the intensity of the positive and 228 negative emotions based on the text, and the sum of these two probabilities is 1. In this study, we used 229 the positive probability as a measurement of the user's mental health status at the time when the tweet 230 was posted. The daily mental health status for a city is measured by calculating the median positive 231 probability for that city on each day 18 . This city-level mental health status ranges from 0 to 1 with 0 232 indicating a strongly negative emotion and 1 indicating a strongly positive emotion. We also calculated 233 the mean value of the positive probabilities and used it to measure city-level mental health status in our 234 robustness check. 235 COVID-19 epidemic data. In this paper, the treatment group was defined as cities that have reported the 236 first COVID-19 case. We collected the date of the first confirmed case in each city from the official but the pathogen was unknown at first and human-to-human transmission was not verified. So, the 239 situation in Wuhan is different from other cities in the treatment group and we excluded Wuhan from our where '( represents the mental health status in city on date measured by the social media data. 263 COVID_19 '( denotes whether the COVID-19 epidemic has occurred in city on date , and takes the 264 value 1 if the city has reported the first COVID-19 case and 0 otherwise. X '( are the control variables, 265 including AQI, mean temperature, mean temperature squared, rainfall, wind speed and cloud. ' 266 indicate city fixed effects, which are a set of city-specific dummy variables. By introducing the city fixed 267 effects, we can control for time-invariant confounders specific to each city, such as geographical 268 conditions and short-term economic level. ' indicate the date fixed effects, which are a set of dummy 269 variables accounting for shocks that are common to all cities on a given day, such as the Chinese Spring 270 Festival Spring and nationwide policies. In this specification, both location and time fixed effects are 271 included in the regression, so the coefficient β estimates the difference in mental health status between 272 the treatment cities and the control cities before and after the occurrence of the COVID-19 epidemic. We 273 expected β to be negative, as both the coronavirus itself and counter-COVID-19 measures such as 274 lockdown could harm the mental health 27,28 . 275 The underlying assumption for the DiD estimator is that treatment and control cities would have 276 parallel trends in mental health status in the absence of the COVID-19 event. Even if the results show 277 that mental health status declines in treated city after the occurrence of COVID-19, the results may not 278 be driven by the epidemic, but by systematic differences in treatment and control cities. For example, if 279 treatment cities have a decreasing trend in mental health status and the control cities not, this could also 280 drive the results. Although we cannot observe what would happen to mental health in the treated cities if groups before the COVID-19 epidemic and investigate whether the two groups are comparable. To 283 achieve this goal, we adopted an event study approach using the following relative time model 29,30,48 : 284 where COVID_19 '(,) are a set of dummy variables, which indicate the treatment status at different 286 periods (weeks). Here, 7 days (one week) are put into one bin (bin ∈ ), so the high volatility of the 287 daily mental health level could not affect the trend test 21 . We omit the dummy for = −1 (one week 288 before the event), so the coefficient ) measures the difference in mental health status between the 289 treatment and control cities in period relative to the difference one week before the treatment. This 290 specification could not only test the parallel trend assumption, but also examine whether the impact of 291 COVID-19 epidemic fades out over time. If the pre-treatment trends are parallel, the coefficient ) 292 would be not significantly different from zero when ≤ −2. The psychological effect of COVID-19 293 would fade out over time during our study period if we observe that ) is negative at first and then 294 becomes not significantly different from zero in subsequent periods when ≥ 0. In all analyses, the 295   Due to some missing values of air pollution and weather data, the numbers of observations in the two columns are not the same. Standard errors are clustered at the city level and shown in parentheses.
*P < 0.05; **P < 0.01; ***P < 0.001. Due to some missing values of air pollution and weather data, the numbers of observations in the two columns are not the same. Standard errors are clustered at the city level and shown in parentheses.

505
Supplementary The numbers of observations in columns (2) and (3)   The randomization refers to the procedure of randomly assigning COVID-19's pseudo presence to treated cities and all cities, respectively, with 1,000 times of repetition. β !"#$%& is the coefficient for COVID-19's pseudo presence, and β is the true coefficient for COVID-19's occurrence reported in Table 1; both were estimated using equation (1).