Data collection
Data were collected using Twitter’s Application Programming Interface (API), a web-based program that allows users to interact with Twitter’s data (i.e., tweets and metadata). This allowed for keyword searches of all tweets containing the words “exercise” and/or “fat”. These searches were implemented daily over a period of 3 months (November 2017 to February 2018). Each daily search provided the most recent tweets of the day (up-to 50,000).
Community structure and overlap
In order to determine if networks of users consistently talking about exercise and fat exist on Twitter (objective 1) users who tweeted at least once a week about “fat” or “exercise” were assigned to a core group, while those who only tweeted once in three months about “fat” or “exercise” were assigned to the visiting group, a threshold similar to previous studies looking at core and visiting communities [7]. To identify users who consistently talk about weight or exercise, only users who tweeted at least once a week for the duration of data collection were included in the study.
In twitter users can follow another user, creating a link (e.g., A à B). In A à B, user A has an out-degree of 1 (because it follows B) and user B has an in-degree of 1 (because it is followed by A). In social networks, the direction of the arrow typically represents information flow, but in the case of Twitter the relationship is inversed because users follow other users to see the content they publicly post, not to send direct information. In short, if user A follows user B (A à B), user A is seeing information posted by user B.
Users with a large following within the network have a high in-degree, while users following a large number of users within the network have a high out-degree, with the sum of both hereby rereferred to as total-degree. Additionally, within the social network, communities can emerge [8]. These communities are composed of users with more connections among them but fewer connections to users outside the community which may themselves be part of different communities.
Relationships between users were mapped to allow for social network analysis and to determine network structure. The relationships were then filtered to only include relationships within the core group. Using these relationships, a core weight-talk network and a fitness-talk network were mapped. Additionally, to determine if there is an overlap between weight-talk and fitness-talk communities, a combined network was mapped, and the metrics of this network compared to the previous ones.
To answer questions regarding structure and overlap (objective 2), four standard network metrics were used: density, average path length, mean total-degree, and clustering coefficient. Density measures how many connections exist in the network and ranges from zero (no connections) to one (all possible connections exist). Average path length refers to the average shortest paths between two users and is important because it indicates how far in average, information has to travel to reach from one user to another. Mean total-degree refers to the average number of connections per user and is useful for determining how well connected a network is. Finally, clustering coefficient is a measure of the extent to which users in a network tend to cluster together, which allows us to understand the network structure in reference to its communities.
Additionally, modularity was calculated within the network. This is the identification of communities within a network based on the similarities between their connections. It was done using the fast unfolding of communities in large networks algorithm [8]. This algorithm decomposes the networks into sub-units or communities, which are sets of highly inter-connected users. Four main communities per network were extracted. Hubs were found by ranking the users based on their in-degree. This allowed for a linguistic corpus analysis based on the communities of each network.
Linguistic corpus
The linguistic corpus is the whole set of text data (tweets) to be analyzed. Simple Natural Language Processing (NLP) techniques like the division of sentences into individual words (tokenization) and their subsequent analyses in duos or triplets (n-grams) were used to analyze these data. Latent Dirichlet Allocation (LDA), which allows for the grouping of observations (words) to be explained by unobserved groups (themes), was also used. A total of 3,772,507 tweets were collected, however out of those, non-English-language tweets (n=510,145) and tweets from people that were not in the core or visiting categories (n=1,291,155) were removed. As a result, the total corpus was reduced to 1,971,207 English-language tweets.
Linguistic n-grams
In order to confirm that the communicational content of the weight-talk and fitness-talk network is explicitly negative, or weight loss related (objective 3) the tweets were divided into the four main communities of each core social network. Finally, a Latent Dirichlet Allocation (LDA) model was applied to differentiate individual unobserved clusters of words, which allowed for a qualitative lexical text analysis and extraction of common themes within the communities.