Data collection
Data was collected using the Twitter Application Programming Interface (API), a web-based program that allows users to interact with Twitter’s data (i.e., tweets and metadata). This allowed for keyword searches of all tweets containing the words “exercise” and/or “fat”. These searches were implemented daily over a period of 3 months (November 2017 to February 2018). Each daily search provided the most recent tweets of the day (up-to 50,000). It also provided information on how many times each tweet was retweeted, the user-id of the user who produced the tweet, and the number of followers of that user.
Community structure and overlap
In order to determine if networks of users consistently talking about exercise and fat exist on Twitter (objective 1) users who tweeted at least once a week about “fat” or “exercise” were assigned to a core group, while those who only tweeted once in three months about “fat” or “exercise” were assigned to the visiting group. Users who tweeted more than once but not once a week were excluded from the study.
In order to study a network of Twitter users, each user is considered as a node. A node can follow another node, creating a link (e.g., A à B). In A à B, node A has an out-degree of 1 (because it follows B) and node B has an in-degree of 1 (because it is followed by A). In social networks, the direction of the arrow typically represents information flow, but in the case of Twitter the relationship is inversed because users follow other users to see the content they publicly post, not to send direct information. In short, if node A follows node B (A à B), node A is seeing information posted by node B.
Nodes with a large following within the network have a high in-degree, while nodes following a large number of users within the network have a high out-degree, with the sum of both hereby rereferred to as total-degree. Additionally, within the social network, communities can emerge [8]. These communities are composed of users with more connections among them but fewer connections to nodes outside the community which may themselves be part of different communities.
Relationships between users were mapped to allow for social network analysis and to determine network structure. The user-id from the core group was used to identify the followers of each individual user. The relationships were then filtered to only include relationships within the core group. Using these relationships, a core weight-talk network and a fitness-talk network were mapped. Additionally, to determine if there is an overlap between weight-talk and fitness-talk communities, a combined network was mapped, and the metrics of this network compared to the previous ones.
To answer questions regarding structure and overlap (objective 2), four standard network metrics were used: density, average path length, mean total-degree, and clustering coefficient. Density measures how many connections exist in the network and ranges from zero (no connections) to one (all possible connections exist). Average path length refers to the average shortest paths between two nodes and is important because it indicates how far in average, information has to travel to reach from one node to another. Mean total-degree refers to the average number of connections per node and is useful for determining how well connected a network is. Finally, clustering coefficient is a measure of the extent to which nodes in a network tend to cluster together, which allows us to understand the network structure in reference to its communities.
Additionally, modularity was calculated within the network. This is the identification of communities within a network based on the similarities between their connections. It was done using the fast unfolding of communities in large networks algorithm [8]. This algorithm decomposes the networks into sub-units or communities, which are sets of highly inter-connected nodes. Four main communities per network were extracted. Hubs were found by ranking the nodes based on their in-degree. This allowed for a linguistic corpus analysis based on the communities of each network.
Linguistic corpus
The linguistic corpus is the whole set of text data (tweets) to be analyzed. Simple Natural Language Processing (NLP) techniques like the division of sentences into individual words (tokenization) and their subsequent analyses in duos or triplets (n-grams) were used to analyze this data. Latent Dirichlet Allocation (LDA), which allows for the grouping of observations (words) to be explained by unobserved groups (themes), was also used. A total of 3,772,507 tweets were collected, however out of those, non-English-language tweets (n=510,145) and tweets from people that were not in the core or visiting categories (n=1,291,155) were removed. As a result, the total corpus was reduced to 1,971,207 tweets.
Linguistic n-grams
In order to confirm that the communicational content of the weight-talk and fitness-talk network is explicitly negative, or weight loss related (objective 3) the tweets were divided into the four main communities of each core social network. After that, a list of linguistic bigrams and trigrams that excluded prepositions, conjunctions, and linguistic fillers was generated. Finally, a Latent Dirichlet Allocation (LDA) model was applied to differentiate individual unobserved clusters of words, which allowed for a qualitative lexical text analysis and extraction of common themes within the communities.