Understanding vaccine hesitancy with application of Latent Dirichlet Allocation to Reddit Corpora

This research paper explores the underlying factors that contribute toward vaccine hesitancy, resistance, and refusal. Using Latent Dirichlet Allocation (LDA), an unsupervised generative-probabilistic model, we generated latent topics from user generated Reddit corpora on reasons for Vaccine hesitancy. Although we hoped to explore the grounds for vaccine hesitancy across the globe, our ﬁndings suggest that the corpus used for analysis had been generated by users living predominantly in the United States.Observation of the topics generated by the LDA model led to the discovery of the following latent factors: (i) fear of risks and side effects, (ii) lack of trust in policymakers, (iii) related to religious belief, (iv) related to mass surveillance theories, (v) perception of vaccination as a precedence to totalitarianism, (vi) racial background pertaining to retrospective events of racial injustice, such as selective sterilization, (vii) depopulation agenda fueled by theories afﬁliated to Global warming and extinction rebellion, (viii) and perception of vaccination as a campaign to quell immigrant population growth, fueled by reports of coerced sterilization of immigrants in the ICE detention.


INTRODUCTION
Identifying and understanding Coronavirus vaccine hesitancy within distinct populations may be a hard task that requires a fair amount of experience in the field of psychology and human behavior. However, research on this particular area of study may aid future public health messaging. Hesitancy and resistance toward vaccines has been a subject of various studies throughout the past [1], [2], [3] and in the recent times of coronavirus pandemic [4]. Survey has been the preferred method to observe and discover the factors that contribute to vaccine hesitant behavior among populations living within specified geographic locations. Using surveys, hypotheses are tested through analysis of the participant responses. However, discovering the latent contributors of an event or an outcome is almost minimal and difficult to attain through analysis of survey responses, if at all. In this research, we explore the underlying factors of vaccine hesitancy through application of Latent Dirichlet Allocation to user generated Reddit corpora on vaccine hesitancy and refusal.
The Internet being the virtual cosmopolitan society aids and simplifies the information retrieval process from populations of diverse socio-demographic backgrounds; and recent advancements in the field of computational linguistics and Natural Language Processing favor a computerized approach [5] to analyze the massive data that remains available at large. Latent Dirichlet Allocation [6] is a Bayesian hierarchy topic model that generates topic keywords from the text corpus with efficiency and reduced complexity at the same time [7]. Besides, the LDA model characterizes the possibility that a document might have multiple topics, whilst unigram models assume the possibility that a given document has nothing more than a topic. In other words, the Latent Dirichlet Allocation model assumes a collection of K "topics." Each topic defines a multinomial distribution over the vocabulary and is assumed to have been drawn from a Dirichlet, η k ∼ Dirichlet(η). Based on the topics, LDA assumes the following generative process for each document d. Foremost, the model draws a distribution over topics θ d ∼ Dirichlet(α). Second, for each word i in the document, the model draws a topic index zdi ∈ {1,...,K} from the weight of the topics zdi ∼ θ d and draws the observed word ωdi from the selected topic, ω ∼ β zdi. For the purpose of simplicity, symmetric priors are assumed on θ and β , but this assumption is easy to be relaxed [8].
Thus, in simple terms, LDA helps to explain the similarity of data by clustering features of the data into unobserved sets. A combination of these sets then constitute the observable data. The method can be applied to solve various tasks including, but is not limited to, topic identification [9], entity resolution [10], and Web spam classification [11].

RELATED WORK
According to a Canadian survey, although only 3 percent of parents refused all vaccines for their children, 19 percent consider themselves to be vaccine hesitant [12]. Vaccine-hesitant parents are a larger and more attentive group compared with vaccine refusers [13], [14]. Sixty-three percent of Canadian parents look for information about immunization on the Internet; of these, close to half perform a Google search [15]. A large number of antivaccine websites exist that propagate a range of anti-vaccine messages [16]. Much of the existing literature on vaccine resistance and hesitancy primarily focus on the explicit reasons why individuals choose not to get a particular vaccine or defy vaccination programmes in general [17], [18], [19], [20]. Survey has remained the preferred methodology to assess the underlying factors that contribute toward vaccine resistance. However, exploratory analysis of opinionated text using Natural Language Processing techniques widens the horizon, leading to identification of latent factors that are less noticeable to the naked eye [21]. Analysis of lexical bundles to observe word combinations or co-occurrence of words, also referred to as "collocation" or "collocability" [22] has been used successfully in the past for information retrieval from text corpora [23], [24], [25]. In the context of machine learning and translation, lexical bundles or collocations are referred to as n-grams or Multi-word expressions (MWEs) [26] and are used in the weighting of topic models in mixture language model adaptation [27]. Internet web forums and social media platforms are a major resource of user generated text data, which when properly analyzed would result in discovery of latent, underlying factors that are otherwise obscure to the human knowledge. Although majority of the mainstream social media platforms censor controversial information related to vaccines, Reddit neither censors nor shuns users out of the platform for unpopular opinion related to vaccines. A goldmine of information, both bizarre and useful, can be found on the platform related to vaccines and a lot more other controversial topics. Unlike other social media platforms that rely on individuals connecting and interacting with people they know in the offline world, Reddit lets to connect people based on things they care about [28]. This feature lets like minded people to discuss anonymously about things they care about, which they cannot in real life without being "cancelled" or "ostracized" for holding an unpopular opinion. Anti-vaccine discussions are rampant in Reddit with active subreddits dedicated to bring vaccine hesitant people together from across the globe. Besides, controversial information spreads faster and further than non-controversial information in Reddit [29], thus attracting a wide variety of comments from users from diverse backgrounds.

METHODS
Methods of data collection, processing, and analysis of the corpus are discussed in this section-we used standard libraries of Python.

Data retrieval
We collected comments from subreddits (r/askreddit, r/antivax, r/antivaccine, r/AntiVaxxers) that specifically discussed "the reasons not to get the vaccine." The Data were retrieved from Reddit using PRAW (Python Reddit API Wrapper), a Python package that allows access to the Application Programming Interface of Reddit [30]. The text data from Reddit API were retrieved into four documents, namely, documents 1,..,4, making the input for the LDA model. The unstructured data with headlines or titles of the posts, comments, and other metadata namely, timestamp and the username. However, excluding the comments, the rest where dropped while processing the corpus.

Data processing
The corpus was normalized, that is the strings were split into tokens; letters were converted from uppercase to lowercase; punctuation, accent marks, and other diacritics were stripped off, followed by the removal of stopwords. In addition to the standard stopwords of the Natural Language Processing Toolkit, we stripped the words "vaccine," "coronavirus," "covid," "covid19," "pandemic," "pfizer", "johnson," "astrazeneca." Our initial observation of the corpus using a word cloud showed that the aforementioned words constituted a major part of the corpus and would tantamount to "collection words," although we did not use any collection words or query search to collect comments from Reddit's API. We rather used hyperlinks. Also, we neither stemmed nor lemmatized the corpus as our initial observations indicated that lemmatization of our corpus altered the context of some of the words that we assumed important for model building. To avoid missing out information, we an unlemmatized corpus for analysis.

Hyperparameter optimization
The parameters of the prior are called hyperparameters. In LDA, the distribution of topics over documents and words have priors that are represented with alpha and beta respectively. The alpha parameter specifies prior beliefs about topic sparsity or uniformity in the documents and the beta hyperparameter controls the distribution of words per topic. Different packages use different notations for these hyperparameters and in Gensim they are denoted by alpha and eta. Besides, gensim uses a fixed symmetric prior per topic [1/number of topics prior]. We did a series of sensitivity tests to determine the Dirichlet Alpha and eta hyperparameters, using both default values of the Gensim library and custom values for both the standard Latent Dirichlet Allocation model and Machine Learning for Language Toolkit model, using different coherence metrics as discussed in the following section.

Coherence measures
Topic Coherence measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic. These measurements help distinguish between topics that are semantically interpretable and topics that are artifacts of statistical inference. For our evaluation, we consider (i) The UCI measure [31] and (ii) The UMass measure [32], both of which have been shown to match well with human judgements of topic quality. These measures compute the coherence of a topic as the sum of pairwise distributional similarity scores over the set of topic words, V. This has been generalized as where V is a set of words describing the topic and ∈ indicates a smoothing factor which guarantees that score returns real numbers. The UCI metric defines a word pair's score to be the pointwise mutual information (PMI) between two words, i.e., The probabilities of words are computed by counting the co-occurrence frequencies of words in a sliding window over an external corpus, such as WikiPedia. To some extent, this metric can be thought of as an external comparison to known semantic evaluations. On the other hand, the UMass metric defines the score to be based on document co-occurrence: where D(x,y) counts the number of documents containing words x and y and D(X) counts the number of documents containing x. More importantly, the UMass metric computes these counts over the original corpus used to train 3/12 the topic models, rather than an external corpus. This metric is more intrinsic in nature and it attempts to confirm that the models learned data known to be in the corpus.

RESULTS
The properties of the retrieved corpus before processing were as follows: Toolkit. An analysis of the extracted bigrams showed a tight interconnection between the bigram components.: most of the bigrams were stable phrases. A representative sample of the identified bigrams from the corpus is given in Table 2.  [eugenics, board], and [coerced, sterilization]. Similar to that of the bigrams, the trigrams identified in the corpus showed a tight interconnection between the trigram components and most were stable phrases as well as shown in Table 3.

Results of Hyperparameter Optimization
We tested the Standard Latent Dirichlet Allocation model and Amhert's Machine Learning for Language Toolkit model [Mallet] for different values of alpha [symmetric, auto, 0.5] while keeping our eta as 0.01 [η = 0.01] for all the implementations. The symmetric alpha for standard LDA is measured by dividing 1.0 by the total number of topics the model takes as the input, while the symmetric alpha for MALLET LDA is measured by dividing 5.0 by the total number of input topics [33]. The results are given in Table 4.  Further, as could be seen in Table 5, Gensim generated different "auto" alpha values for each topic of the standard LDA model with a mean of 0.2733 and a standard deviation of 0.0901. Likewise, the mean alpha of the mallet LDA is 0.26225 and a standard deviation of 0.142. We tested our LDA models for different hyperparameter

5/12
values, however we chose "auto" alpha over symmetric alpha because the latter may reduce the number of very small, poorly estimated topics, but may disperse common words over several topics. In addition, rather than deciding on fixed hyperparameters for the entire collection (with each topic having a similar probability in the model, and each word having a similar probability in each topic), it makes much more sense to allow for some differentiation between overall topic probabilities in a model: after all, it makes perfect sense that some topics are more general and therefore widespread while others are more specific and therefore less common [11]. This intuition is implemented in the hyperparameter optimization function of Mallet [34]. Table 4 shows the coherence by number of topics for standard LDA and machine learning for language toolkit models evaluated using c v and UMass metrics. We tested the models for different values of k between 1 and 25, while the hyperparameters alpha and eta were set as default. We observed that graphs of both standard and mallet LDA models evaluated using c v metric were quite similar, and the graphs of standard LDA and mallet LDA models evaluated using UMass metrics were similar to each other as shown in Figures 1 and 2.  In c v metric, the maximum value indicates the optimal topic coherence [35], while in the case of UMass metric, the value close to zero indicates the highest coherence [35]. The highest coherence value estimated by the standard LDA model using c v metric was 0.717 for the number of topics, k = 7. Likewise, the highest coherence value evaluated by the Machine learning for language toolkit model using c v metric was 0.720 for the number of topics, k = 8. On the flipside, the closest value to 0 in the list of coherence values generated by sLDA model using UMass metric was 0.242 and the corresponding number of optimal topics suggested by the model was 10 6/12 [k = 10]. The value closest to zero in the list of coherence values generated by the mLDA model was 0.018, for the number of topics, k = 8. Figure 1 and 2 show how coherence values vary for different values of k [between 1 and 25]. We chose k= 8 as the optimal input for our LDA topic models based on our previous observations from hyperparameter optimization and coherence evaluation. Using the above criteria, we built a standard LDA model and a machine learning for language toolkit model [both using c v as the coherence metric], to predict the k number of topics and their corresponding word probabilities from our tokenized corpus. The results are discussed in the following section.

Evaluation of generated topics
The properties of our topic model were as follows: number of topics, k = 8; hyperparameters [alpha and eta] = set as default / auto, and coherence metric set as c v. Topics generated by the standard LDA model are given in Table 6 and the topics generated by machine learning for the language toolkit model are given in Table 7. Close observation of the results generated by the models indicated that mLDA has outperformed standard LDA, in generating topics from the corpus. The standard LDA model, despite a high coherence [coherence(c v) = 0.717], did not generate coherent topics, except for three, as shown in Table 6. The topics we observed to be coherent were, fear of risks and side effects, lack of trust in policymakers, and related to Evangelicalism. The words in Topic 1 are fit to be collectively classified as "Fear of Risks and Side Effects." Similarly, the words observed in Topics 6 and 7 are fit to be collectively categorized as "Lack of Trust in Policymakers" and "Related to Evangelicalism" respectively. Close observation of other topics indicate that some of the topics are partially coherent, while some are erratic with words mixed up with zero possibility of any coherence at all. On the flipside, the Machine Learning for Language Toolkit model surprisingly did a fair job of generating topics from our topics as shown in Table 7.
We named the topics with appropriate labels as shown in Table 7. Although few unrelated words were observed in Topic 7 and 8, the majority of the other words indicate that the topics are related to racial system and immigration respectively. Both standard and MALLET LDA models generated topics related to "risks and side effects," "lack of trust in the policymakers", and "Evangelicalism." However, the results of the standard LDA model indicate that words are mixed up except for three topics, and it gets erratic at the end. However, observation of the bigrams and trigrams indicate that the words coexist in the corpus, like "immigration" and "sterilization," which together make phrases and sentences that talk about sterilization of immigrants in the ICE detention, etc. Although sterilization and immigration are totally different topics, the frequent coexistence of them in the corpus might have influenced the output generated by the standard LDA model. On the flipside, the topics generated by machine learning for the language toolkit model [Mallet] are less erratic and more precise in terms of outcome, leading to the discovery of eight latent topics from the tokenized corpus.