Dictionary-based Classification of Tweets About Environment

4 In the era of social media, the huge availability of digital data (e.g. posts sent through social 5 networks or unstructured data scraped from websites) allows to develop new types of research in a 6 wide range of fields. These types of data are characterized by some advantages such as reduced 7 collection costs, short retrieval times and production of almost real-time outputs. Nevertheless, 8 their collection and analysis can be challenging. For example, particular approaches are required for 9 the selection of posts related to specific topics; moreover, retrieving the information we are 10 interested in inside Twitter posts can be a difficult task. 11 The main aim of this paper is to propose an unsupervised dictionary-based method to filter tweets 12 related to a specific topic, i.e. environment. We start from the tweets sent by a selection of Official 13 Social Accounts clearly linked with the subject of interest. Then, a list of keywords is identified in 14 order to set a topic-oriented dictionary. We test the performance of our method by applying the 15 dictionary to more than 54 million geolocated tweets posted in Great Britain between January and 16 May 2019. 17

performs better with a F1 score equal to 0.83. They also run feature-based algorithms (logistic 19 regression, SVM and Naïve Bayes algorithm) with linguistic, temporal and geospatial features to 20 predict people behavior during hurricane events. In this case SVM provides the best performance 21 with F1 values included between 0.47 and 0.79 according to the considered features. 22 5 Source: https://amiibereval2018.wordpress.com/ (latest access on 2019, September 3rd). 6 The F1 score is a performance index depending on precision and recall (See Sect. 5 for its definition).
The method we propose in this paper is unsupervised and dictionary-based. Differently from Cody 1 et al. (2015) and Pruss et al. (2019), it builds a dictionary that includes the most common bigrams 7 , 2 trigrams 8 and hashtags about environment without needing a starting set of keywords. Only a list 3 of OSA has to be provided in advance and a limited number of human checks are needed in order 4 to avoid the inclusion of too general keywords or acronyms that would lead to the selection of 5 tweets not strictly pertaining to environment (or to the chosen topic). Thus, our approach 6 minimizes the amount of required human work, because it doesn't need the set of labeled tweets 7 for training (as required by supervised approaches) or a predefined set of keywords (that could be 8 too much general or not completely focused on the studied topic). At the same time, thanks to the 9 arbitrary selection of the OSA and to the possibility of reviewing step by step the dictionary 10 creation, it is very flexible and could be applied to and personalized for any topic of interest. 11 12

13
In the following subsections we describe the two datasets used for the analysis. The first one (Sect. 14 3.1) is composed by a sample of tweets posted by OSA related to the analyzed topic, environment. 15 Starting from these data, the algorithm sets up the dictionary. The latter is then applied to the 16 second dataset (Sect. 3.2), composed by the tweets posted in GB between 2019/01/14 and 17 2019/05/13. The algorithm has been implemented using the R software 9 . 18 19

20
The general idea behind the tweet selection is that posts speaking about the same topic should be 21 similar and different from tweets related to other themes. As a consequence, tweets pertaining to 22 a certain topic should generally include similar words or combination of words. Our work aims at 1 detecting and studying posts about environment. For this purpose, our preliminary objective is to 2 set up a dictionary including the most common and relevant keywords related to such a topic. As 3 first step, we identified 12 OSA linked to environment. In particular, we chose verified accounts 10 4 (or profiles that have at least 10,000 followers) belonging to no-profit associations, research 5 institutes and intergovernmental organizations whose activity is related to environment 11 . The 6 OSA selection is an arbitrary phase of the algorithm. The chosen accounts are selected because of 7 their popularity and with the aim of covering all the possible aspects of environment (e.g. climate 8 change, plastic pollution, nature protection). Note that, as it will be described in Sect. 4.2, the OSA 9 choice can cause effects on the final dictionary. 10 For each account, we retrieved all the most recent posted or retweeted tweets that Twitter leads 11 us to download up to 2019/05/10. Among the obtained 38,611 tweets, we kept exclusively posts 12 written in English. Then, we cleaned their corpus by removing url links, html codes, non-ascii and 13 special characters, but we kept hashtags. This list of cleaned tweets is our first dataset. In Sect. 4.1 14 we analyze this dataset, in order to detect the most recurrent expressions (i.e. bigrams, trigrams) 15 and hashtags used by the considered OSA. ii. an optional extra cleaning stage before the choice of the thresholds (step b in Figure 1). 19 The aim is to remove, from the list of selected OSA expressions, some common terms 20 which are widely used in Twitter and very likely not related to environment. Even if this 21 step is optional, we highly suggest to use it, because it reduces the standard review process 22 performed at step d in Figure 1. The extra cleaning considers the full set of GB tweets 23 described in Sect. 3.2 to identify the list of general expressions, i.e. popular bigrams and 1 trigrams (step a in Figure 2). These recurrent expressions are used to remove from the OSA 2 bigrams and trigrams list general expressions such as "trump administration", "taking 3 action", "million people". It is important to note that this procedure can be performed by 4 using the full set of GB tweets or a smaller sample, in order to reduce the computational 5 time. Our empirical experience with our case study demonstrates that the final dictionary 6 does not change considerably by using different samples or the complete dataset of GB 7 tweets. For this reason, we decided to use a random sample of 3.5 million tweets collected 8 between March 10 th and May 13 th , 2019. We arbitrary decided for a very low threshold: in 9 the list of general expressions we take into account just bigrams and trigrams tweeted at 10 least 20 times. This way, we obtained 30,656 general expressions (step a in Figure 2). 11 However, this vector of recurrent bigrams and trigrams may contain expressions linked to 12 environment, such as "climate change", which we do not obviously want to be part of the 13 list (otherwise they will be not included in the dictionary). Thus, a review of the list of 14 general expressions is necessary. This can be done by adopting one of the following two 15 approaches: 16 a. user-based approach (step b in Figure 2): the user examines all the general 17 expressions one by one and remove the ones related to environment; 18 b. list-based approach (step c in Figure 2): in this case we assume that a set of 19 expressions related to environment is available (prepared ad hoc by the researcher 20 or taken from existing dictionaries). The two lists will be matched and the 21 environment-related expressions will be removed from the set of general 22 expressions. 23 1 prepared specifically for our case study. After removing from the list of general expressions the 2 terms related to environment, the resulting vector is composed by 30,632 expressions. It is then 3 possible to proceed with the extra cleaning of the OSA bigrams and trigrams by removing all the 4 terms included in the set of 30,632 common expressions (step b in Figure 1). Moreover, in this 5 same step, all the bigrams and trigrams that contain country names and USA state names are 6 removed. The extra cleaning step removed 7 bigrams and no trigrams (see the red expressions in 7 Table 1). Finally, after the extra cleaning, the standard cleaning (step d in Figure 1) is used to 8 review the new list of OSA expressions in order to exclude other terms not related to the studied 9 topic, such as "start donating" or "coral reefs" (see blue expressions in Table 1). For our 10 application this standard review step removed 10 bigrams and 1 trigram. As result, we obtain the 11 final list of bigrams and trigrams related to the topic. 12 16 air clean, air pollution, air quality, carbon emissions, carbon pollution, clean air, clean energy, climate action, climate change, climate conference, climate crisis, climate reality, climate science, climate solutions, coal ash, coal plants, coalfired power, conference cop, environmental laws, extreme weather, food waste, fossil fuel, fossil fuels, fuel industry, gas drilling, gas emissions, gas industry, global climate, global temperatures, global warming, greenhouse gas, healthy environment, offshore drilling, palm oil, paris agreement, plastic bags, plastic bottles, plastic packaging, plastic pollution, plastic straws, plastic waste, renewable energy, singleuse plastic, singleuse plastics, tar sands, toxic chemicals, toxic pesticides, warming world, weather events.  @climateprogress, @ClimateReality, @friends_earth, @Greenpeace, @GreenpeaceUK, @LessPlasticUK, @PlasticPollutes, @UNEnvironment, @UNFCCC, @World_Wildlife, @WWF, @WWFScotland, @NRDC, @nature_org, @EnvDefenseFund, @Earthjustice, @foe_us, @guardianeco, @HuffPostGreen, @insideclimate, @PlanetGreen, @ClimateCentral.
5 Results: dictionary performance 1 In order to evaluate the performance of the dictionary-based filtering, we randomly choose 600 2 tweets selected and 600 not selected by the algorithm (i.e. classified as not linked to 3 "environment"). Then, we manually classify these posts into two categories: "related" and "non-4 related" to environment. This allows us to compute the following relevant quantities, which can 5 be collected in the confusion matrix reported in Table 3:  The algorithm performance has been evaluated through the following indexes, based on the 19 confusion matrix shown in Table 3: 14 For all the indicators the range is between 0 and 1 and the "the higher the better" rule holds. 15 Results, here, are expressed in percentage terms. 16 AC can be used as a first overall measure to evaluate the classification algorithm performance, 17 taking into account tweets correctly classified on the total number of posts. In particular, following 18 this criterion, our method is able to classify correctly 98.42% of the total number of tweets. This  Finally, instead of proposing as starting point a list of single keywords or terms, we propose the 23 use of bigrams and trigrams; this choice reduces the error of misclassification related to the use of 24 algorithm can also be used to study how quickly and how much a dictionary regarding our topic 22 could change over time. 23 In addition, our method allows to filter tweets by topic, thus it can be applied as starting point to 1 develop a wide variety of analysis regarding other topic or can be used to go deeper into the study 2 of our same topic. Further studies could be focused, for example, on sub-arguments of 3 environment. For example, it would be interesting to filter tweets related to local problematics 4 (i.e. air pollution) rather than to global issues (i.e. global warming) for more detailed longitudinal 5 and spatial studies of the sentiment. This extremely detailed information could be used to study 6 the sentiment on a small scale and, at the same time, to explore how much people care about big 7 themes such as earth health. In this way, we are able to capture the population feelings, to link 8 this to national and/or international policies and events and to identify the main drivers of the 9 inclination and sentiment trends. 10 Finally, the flexibility of our method can be finalized to create several dictionaries for all the sub-11 topics connected to a more general phenomenon, such as the well-being (that includes, by nature, 12 different dominion, e.g.: social involvement, health, work status, discrimination; see Toninelli and 13 Cameletti 2018). In this case, selected tweets can be used to study the single dominions and to 14 estimate the subjective well-being and/or how much a single dominion is able to affect the 15 subjective well-being of a population. This will represent an improvement with respect to standard 16 questionnaire-based surveys, such as the European Social Survey 18 . Better, the two types of 17 sources can be integrated. In fact, thanks to the real-time collection of tweets, it will be possible to 18 obtain timely information about a multidimensional phenomenon such as the well-being with a 19 very high temporal and spatial resolution. These results can be of high value for evaluating the 20 interventions of policy makers, for measuring the effectiveness of advertising campaigns, for 21 studying a lot of other socio-demographic phenomena.