Research design
In this study, I selected tweets posted on Twitter as the object of analysis. Prior to the analysis, it is necessary to understand the peculiarity of Twitter and its posts: since its launch in March 2006, Twitter, as a so-called microblogging site, has been expanding its users all over the world due to the ease and convenience of writing short messages of 280 characters. Today, it has grown into a leading social networking service (SNS) with 1.3 billion accounts, including those of heads of state and dignitaries, and 330 million monthly active users, who post 500 million messages every day (statista 2021). Expressions of emotions and sentiments, such as opinions expressed on this platform, as well as behavioral patterns, have become valuable targets for analysis by data and analysts. In addition, social networking sites can be linked to specific topics through hashtags, and multiple communication through likes, retweets, and replies is also possible (Bruns & Liang, 2012). Thus, analysis of social networking sites is being applied in various fields such as market research, product reviews, traffic prediction, among which Twitter analysis is being used in many fields due to its usefulness.
Not only that, but the expression of opinions and emotions on Twitter extends to all kinds of things, including people, things, concepts, and policies. In fact, since 2020, there has been a sharp increase in the number of studies on Twitter posts about the global pandemic of COVID-19. This is because it is believed that Twitter posts can provide a lot of useful information in exploring public sentiments and concerns about an infectious disease that has rapidly spread around the world and changed people’s lives (Xue et al. 2020). And this can also be applied to find out what people in modern world are thinking about the concept of “biodiversity”, which was born in the 1980s. In other words, the diachronic analysis of Twitter posts may provide us with new insights that we have not obtained before.
However, it is important to be careful about equating the discourse space on Twitter with the actual discourse space. This is because there are various obstacles to analysis, such as bias in user demographics, individual differences in tweet frequency, the existence of bots, and the inclusion of a lot of useless noise other than text. Therefore, when analyzing posts on Twitter and drawing certain conclusions, we should keep in mind that it is a unique platform with the above limitations and restrictions when analyzing and interpreting them.
The methods used in this study include n-gram counting and comparison, sentiment analysis using the NRC emotion lexicon (Mohammad et al. 2013), topic modeling using Latent dirichlet allocation (LDA) (Blei et al. 2003), and qualitative analysis of tweet texts. The outline of the research procedure is as follows:
1. I collected all tweets containing the keyword “biodiversity” from March 2006, when the Twitter service was launched, to December 2020, and extracted tweets that were purely in English.
2. I pre-processed all tweets from 2010 to 2020, counted n-grams (bigram and trigram), and listed the top 20.
3. In the same way, I counted eight types of emotion words using the NRC emotion lexicon for pre-processed tweets from 2010 to 2020, calculated the percentage of each type used in the total number of words, and visualized the results on charts.
4. For each year, I explored LDA topic models and constructed the model that seemed to be optimal. In this paper, I discussed the models for the years 2010 and 2020 through visualization.
5. Based on the visualized information, I selected distinctive topics and created another sentiment charts to examine the contents of the texts.
6. For some distinctive topics, I provided an overview of their tweet texts.
Data collection
In this study, I collected tweets containing the word “biodiversity” from March 21, 2006, when the Twitter service started, to December 31, 2020. The purpose of this study was to explore the usage of “biodiversity” in normal contexts as well as in the hashtag “#biodiversity”. For the collection, I first applied for the Academic Research product truck released by Twitter in January 2021 and obtained access to the full Twitter archive. Then, I used the open application programming interface (API) provided by Twitter and the Python programming language (ver. 3.8.8). As a result, the total number of tweets by December 31, 2020, is 2,609,834, which is outstanding compared to other major language expressions of “biodiversity” (biodiversité, biodiversidad, biodiversität). Of these tweets, 2,405,937 tweets were purely in English text, accounting for 92% of the total (Fig 1). Therefore, I decided to focus on tweets in English in this study. The collected tweet information includes “text”, “author_id”, “created_at”, “lang”, “entities”, “geo”, “public_metrics”, and “text”. In this study, I focused on “text”. This is because the purpose of this study is to understand the general speech on Twitter, and the attributes and location information of posters are out of the scope of this study, and also because there were many tweets with missing information. Fig 1 shows the total number of tweets and the number of tweets in English for the period covered. According to this figure, the number of relevant tweets increased by about two times from 2010 to 2017, but the increase was relatively slow. However, since 2018, the increase in the number of tweets has clearly become larger.
Pre-processing the raw dataset
Prior to analysis, it was necessary to pre-process Twitter raw data into a form suitable for analysis. In this study, I followed a number of related studies on Twitter analysis and performed two types of pre-processing in python, one for sentiment analysis and the other for topic model building. The specific pre-processing is as follows:
1. Extracted only English tweets from the collected tweets.
2. Removed tweets with duplicate text.
3. Removed @usernames and links (pasted URLs such as http and www) in the text.
4. Removed special characters and punctuations from the text.
5. Other strings that did not have any particular meaning were excluded by designating them as “stop words”.
6.The texts in the above state was saved for sentiment analysis.
7. Also removed hashtagged words.
8. Tokenized the texts. Deleted tweets with less than 3 tokens.
9. n-grams (bigrams and trigrams) were counted and saved.
10.Performed lemmatization of the tokens.
Data analysis
In this study, I used both quantitative and qualitative research. First, as a quantitative study, I overviewed the data through visualization using LDA topic modeling and sentiment analysis, and second, I qualitatively examined the content of the specific tweet texts that were categorized. The following sections describe each of the analysis methods.
Counting and comparing n-grams
In this study, I counted n-grams for each year to get an overview of the set of tweet texts that were narrowed down by pre-processing as described above. An n-gram is a sequence of words, where two words are called a bigram and three words are called a trigram. Here, I used Gensim, which is available in Python. By comparing the top-ranked n-grams, it was possible to understand more specific keywords that appear in each text. And it was also possible to see the characteristics of the words used with “biodiversity” throughout the entire period, and to identify the words that were characteristic of each year. In this paper, the top 20 bigram and trigram terms from 2010 to 2020 were listed for comparison and discussion.
Sentiment Analysis
Sentiment analysis is defined as an automated process of mining attitudes, opinions, views, and emotions from text, speech, tweets, and database sources using natural language processing (NLP), and is said to be the process of analyzing people’s feelings, attitudes, opinions, and emotions towards elements such as products, individuals, topics, organizations, and services (Kharde & Sonawane 2016). There has been a rapid increase in the number of examples of analysis of social networking sites, the most popular of which is the categorization into Positive, Negative, and Neutral. However, in recent years, various methods such as Machine Learning Approaches, Lexicon-Based Approaches. have been devised and are showing rapid development.
In this study, I used Lexicon-Based Approaches, among which is a method called Dictionary-based. I used the NRC emotion lexicon (Mohammad et al. 2013). It is a crowdsourced task for tens of thousands of English words, manually curated, and encoded with emotions (positive or negative) and discrete models of emotions covering anger, expectation, disgust, fear, joy, sadness, surprise, and trust via binary variables for each emotion (Mohammad 2020). In my preliminary research, I found that the categorization of positive, negative, and neutral was extremely abstract and subject to wide swings, ultimately forcing me to carefully read and examine specific texts. Thus, I decided to read eight types of emotions from the tweet texts, as I needed to clarify the direction of more specific emotions.
According to the developer of the NRC emotion lexicon, the lexicon works by comparing multiple sets of data and by producing a percentage of the total number of words (Mohammad 2020). In this study, I searched for emotions expressed in tweets as a whole and in individual tweets by finding the total number of words belonging to each of the eight types of emotions in the NRC emotion lexicon and their percentage of the total number of words in the text of tweets containing the word “biodiversity”. And I excluded the years from 2006 to 2009, when the total number of tweets per year was small, and calculated the total number of words by using Python, and the percentage of words constituting each emotion in the preprocessed tweet texts for each year from 2010 to 2020, and visualized them.
Latent Dirichlet allocation(LDA)
For the purpose of this study, which is to explore the discourse on “biodiversity” on Twitter, it was essential to explore the topics of each year as discussed by users. In this study, I used Latent Dirichlet allocation (LDA) as a method for this purpose (Blei 2003). LDA is a form of Unsupervised machine learning, in which the model assumes that each document consists of a mixture of various potential topics, and that each topic is characterized using a distribution of linguistic units. Furthermore, the algorithm generates pairs of frequently mentioned words, pairs of co-occurring words, potential topics in a document and their distributions over those topics based on the data itself (Xue 2019). To date, it has been applied to all kinds of sociological research, including the analysis of news articles, and is considered to be an efficient method for identifying patterns, themes, and structures in large, unstructured groups of text, such as tweets in Twitter, and classifying them by topic based on these patterns.
In this study, I used the Python library “Gensim” and the java open-source software “MALLET” to run multiple trials on the tweet texts of each year from 2010 to 2020 with different numbers of topics. And the best topic models were explored, constructed, and visualized. However, for reasons of paper space, I used NRC emotion lexicon to count and calculate the percentage of emotion words in the modeled topics for 2010 and 2020 only, and the visualization results were presented for comparison and analysis.
Qualitative Analysis
After categorizing, visualizing, and interpreting the data through quantitative research, it would be beneficial to conduct specific analysis through qualitative research. In this study, I also extracted a set of keywords that constituted each topic when I built the topic model and identified representative tweets for each topic. All tweet texts were assigned a score (Topic_Perc_Contrib) for their weight within each topic, and the original text of the tweets in the highest range was posted as Representative Text.
In this way, it was possible to infer the dominant discourse by identifying and examining particularly important texts from a large number of tweets. At the same time, this was an attempt to minimize the drawbacks of quantitative analysis of tweets. While it would have been possible to examine and define all the categorized topics in detail, I focused on only a few distinctive topics and examined the tweet contents that constituted them. By using the above method, it would be possible to roughly grasp the dominant discourse of each year in the Twitter space.