Joint Topical Word Embedding for Detecting Drift in Social Media Text

Social media texts like tweets and blogs are collaboratively created by human interaction. Fast change in trends leads to topic drift in the social media text. This drift is usually associated with words and hashtags. However, geotags play an important part in determining topic distribution with location context. Rate of change in the distribution of words, hashtags and geotags cannot be considered as uniform and must be handled accordingly. This paper builds a topic model that associates topic with a mixture of distributions of words, hashtags and geotags. Stochastic gradient Langevin dynamic model with varying mini-batch sizes is used to capture the changes due to the asynchronous distribution of words and tags. Topical word embedding with co-occurrence and location contexts are specified as hashtag context vector and geotag context vector respectively. These two vectors are jointly learned to yield topical word embedding vectors related to tags context. Topical word embeddings over time conditioned on hashtags and geotags predict, location-based topical variations effectively. When evaluated with Chennai and UK geolocated Twitter data, the proposed joint topical word embedding model enhanced by the social tags context, outperforms other methods. model The average perplexity 2009) snapshots are computed for evaluating dynamic topic models. Topical clustering is compared for evaluating different topical embedding models (Niu et Topic drift is evaluated using KL divergence (Cai et al. and MSD (Yang & Donnat 2017) measures.


INTRODUCTION
Topics identified in a document make sense to the group of words. Topic drift is different from semantic drift or semantic change from the perspective that the first one is related to the change in the group wise distribution of words over time, where the second one reflects the change in the usage of words over time. Tweets and micro blogs have diverse contents like text, urls, hashtags, geotags, mentions, image, emoticons and time stamp. User groups and user interests are connected with the locations and have faster dynamics based on location specific events. Words and hashtags distributions in tweets fluctuate at regular intervals during festivals (such as Diwali and Christmas) and sports events (such as Olympics). These changes may or may not lead to the topic drift. Hence, the topic drift in tweets is a complex issue, and must be processed with a different perspective.
From literature, it is evident that the topic drift can be examined by modeling time with word co-occurrence patterns (Wang & McCallum 2006), identifying topic boundaries (Liu et al. 2013), detecting the sub topics (Fei et al. 2015), quantifying the impact of the topic on a location (Bernabe-Moreno et al. 2015) and representing the context as a cluster of hashtags (Alam et al. 2017). Social media text reflects a cultural change in the social environment that leads to topic drift where location plays a major role. Geotagged tweets reflect the behavior of people in that region. Location proximity relates messages with the same event in addition to time (Atefeh & Khreich 2015) and used for detecting location based topical variation. Zhang et al. (2015) combined clustering and topic model to study topic clusters of documents from a geolocation. The most commonly used methods for modelling topics is latent Dirichlet allocation (LDA), which represents the discrete unstructured text document as random mixtures over latent topics and topic as a distribution of words (Blei et al. 2003). There is a possibility of grouping a The evolving nature of the contents of social media text is tremendously high. Hence, a suitable dynamic modeling approach is required for capturing the topic dynamics from a large number of tweets that are partitioned across either discrete or continuous time slots. The dynamic topic models (DTM) described, using the variational Kalman filtering and non-parametric wavelet regression (Blei & Lafferty 2006), online inference using a stochastic EM algorithm (Iwata et al. 2010), Gibbs sampling with stochastic gradient Langevin dynamics (Bhadury et al. 2016) and continuous time dynamic topic model using Brownian motion (Wang et al. 2006) illustrate how the posterior inference of topics over time can be performed effectively.
Continuous time dynamic topic models are suitable for generating topics from sequential collection of documents (Wang et al. 2006) and may not be appropriate for tweets because, Twitter APIs cannot be used to find tweets older than a week. In addition to that, topic in tweets depends on multiple attributes with varying dynamic distributions. Hence, the dynamic topic model used for Twitter data should be discrete (Bhadury et al. 2016) that is capable of integrating varying dynamic behavior of words, tags, urls and mentions over time. should incorporate additional contexts related to the mixture of distribution of words, tags, urls, mentions and geolocations. Hashtags are the labels for tweets that share a common topic, which are generated by the internet users to categorize the concepts. Hashtags are associated with the co-occurrence context. Geotag is a metadata that describes the geographic location of tweets and is connected with location contexts. These tags enable the achievement of high quality topical word embeddings enhanced with hashtags and geolocations. These embeddings can be learned incrementally over time for investigating the topic drift in tweets.
The proposed model include, a tags context-based topic model to generate the topic distribution over words conditioned on hashtags and geotags and, a discrete time stochastic gradient Langevin dynamic model with varying mini-batch sizes for capturing the changes in the asynchronous distribution of words, hashtags and geotags. Another contribution of this paper is, a topical word embedding model that enhances the proposed topic model to yield topical word embedding vectors trained on contexts related to hashtags and geotags.

RELATED WORK
The methods used for analyzing the topic drift, existing topic models for the twitter data, the available discrete time dynamic topic models and the use of topical word embedding are presented.

Topic Drift in Twitter Data
Cataldi et al. (2010) extracted terms from tweets, estimated user authority based on a directed graph of active authors, computed 'term life cycle' using aging theory and selected emerging terms based on its age. They constructed a 'topic graph' that links emerging terms with their co-occurring terms. Emerging term connected with the semantically related words lead to a subgraph of the topic graph showing drift in topics. Saha & Sindhwani (2012) employed online nonnegative matrix factorization (NMF) to generate topics of streaming text like blogs and tweets, and formulated a temporal regularization operator for topic evolution. Fei et al. (2015) proposed 'cluster based subtopic detection algorithm' to cluster the tweets into subtopics for examining topic drift. However, hashtags are not considered in analyzing the drift. Rosa et al.
(2011) performed supervised topical clustering of tweets into predefined categories using hashtags as topic indicators. After clustering at coarse and fine levels, they observed the difference between clusters in training and testing data. Alam et al. (2017) represented topic as 'word distribution' and context as 'hashtags distribution' and studied the context over time with continuous and discrete time distribution. Time attribute for topic evolution can be directly added to the topic model (Wang & McCallum 2006) for continuous time distribution. Liu et al. (2013) employed LDA with Gibbs sampling for extracting topic words and measured coherence of topical content as change in alignment of topic word's sequence over time. However, the other topic indicators such as urls, geolocations and mentions have not been considered. The location context has impact on topic distribution equivalent to the co-occurrence context with hashtags.
Bernabe-Moreno et al. (2015) studied the user interactions with topic and investigated the impact of the topic in a location over a period of time in tweets. However, each location has its own influence on the topic distribution in tweets and the location attribute should be integrated into the topic model for extracting topics from tweets.

Extention of Topic Models for Tweets
Labeled LDA (Ramage et al. 2009(Ramage et al. , 2010 was described by categorizing the twitter content into four types of dimensions like substance, social, style and status using hashtags, mentions and emoticons in tweets. Zhao et al. (2011) proposed Twitter LDA for generating topics from tweets by distinguishing the word distributions as 'general words' and 'topic words' and categorized topics as longstanding, entity-oriented and event-oriented for opinion mining. for identifying locations from the tweets. Topic models for examining topics over time in tweets must integrate the topic's distribution over words, hashtags and locations, to track the influence of co-occurrence as well as location contexts. A dynamic mechanism can be incorporated into the topic model to study the topic drift in tweets, that are partitioned into different snapshots over time.

Tracking Dynamic Topics
Traditional time series models are suitable for continuous data, whereas topic models are should be able to process the varying dynamic behavior of hashtags and geotags in addition to words.

Topical Word Embedding
Topic models can be enriched using a WE model. Li   The probability distribution of words 'w' in document 'D' sampled under the topic 'z' is described as Gibbs sampling is suitable for generating posterior probabilities by sampling variables from the conditional distribution. With LDA, the joint probability of topic proportion 'θ', set of 'k' topics 'z' and a set of 'Nd' words are given (Blei et al. 2003) as

LDA_Tags
The proposed LDA_Tags differs from the topic model described by Rosa  context of both hashtags and geotags can be incorporated into the topic model for the analysis of topic drift in the media text. Hashtags 'y' and geotags 'x' are distributed over topics based on the Dirichlet distribution (θ) and topic 'z' has multinomial distribution (Φ) over words.

Figure 3. LDA_Tags
Now the document is modeled as a mixture of topics (z) and the topics as a mixture of distributions over the words (w), hashtags (hv) and geotags (gv). With LDA_Tags, the probability distribution of word 'w', hashtag 'y' and geotag 'x' in document 'D' sampled under the topic 'z' are described as done by Blei et al. (2003) for words.

Algorithm 2. LDA_Tags
Vocabularies of words (W), hashtags (hv), geotags (gv) and hyper parameters ( , ) define the topic model. Topic proportion is sampled per each document, whereas hashtag , geotag , topic and word are sampled per each word 'w' in the document 'd' (Algorithm 2). Now, the joint probability of the topic proportion (θ) with a set of 'k' topics, a set of Nd words, hashtags & geotags for the given hyper parameters ( , ) is defined by modifying the above equation (2) as, Where, hv -hashtag vocabulary, gv -geotag vocabulary, -hashtag distribution, -geotag distribution.
In Twitter data, hashtags have faster dynamics compared to that of words and geotags.
New hashtags are created every day and their distributions are unpredictable. Hence, varying granularities are to be processed for examining the distribution of words, hashtags and geotags.

Topic Dynamics with Hashtag and Geotag Contexts
The proposed dynamic topic model is based on stochastic process that represents the system with random variables whose probability distributions are randomly changing over time.
The proposed dynamic topic model is described using stochastic gradient Langevin dynamics (SGLD), an incremental gradient descent for minimizing an objective function which runs through a subset of samples. The SGLD parameter for ' ' at t th iteration is given (Welling & Teh 2011) as where, SG noise = ∑ ∇ =1 log � �θ� LD noise = ~ℵ( |0, ) -set of data items, m -mini-batch size, -mini-batch data items, -step size or learning rate The dynamic parameters are inferred using methods like 'Gibbs sampling' which resamples each random variable iteratively given the remaining variables. However, different mini-batch sizes of words, hashtags and geotags for different snapshots are required to be computed.

Varying Mini-batch Size
The dynamic topic model for LDA_Tags must handle the difference in the rate of change Larger mini-batch size may not be able to track contextual change with words and hashtags effectively and smaller mini-batch size may lead to unnecessary processing of location specific distribution. An increase in mini-batch size leads to decrease in convergence rate and reduces the communication cost (Li et al. 2014). Hence, both small and large mini-batch sizes can be optimally initialized and can be varied based on distribution in previous snapshots for learning change in the posterior distribution. Mini-batch sizes for words ( +1 ), hashtags ( ℎ +1 ) and geotags ( +1 ) to be replaced for 'm' in equation (10) can be computed as Dynamic topic model parameters for distribution of words and tags cannot be integrated and must be computed separately.  (i) Sampling parameter for mean at time 't', ~ℵ( | , 2 ) (ii) Sampling parameter for topic-document proportion at time 't',

Kullback-Leibler Divergence of Topic
Topic drift can be described by Kullback-Leibler (KL) divergence (Cai et al. 2014) of distribution of a topic 1 over words 'w', conditioned on hashtag 'y' and geotag 'x' over time as If KL divergence of a topic is high during a particular period, it denotes a drift in the topic during that interval.

Topical Word Embedding with Hasgtag and Geotag Contexts
The WE model predicts the context words for the given word, whereas, TWE predicts the context words for the given word and the topic. However, the proposed TWE model is able to predict the context words for the given (word, topic) pair conditioned on hashtag and geotag contexts (Equation 25). Both HCV and GCV can be jointly learned as performed by Niu et al.
The joint learning of topical embedding using both hashtag and geotag context vectors is described as below ( Figure 6).

Figure 6. Joint Learning of Hashtag and Geotag Contexts
Once the model is trained with the two sets of word_topic pairs [(<wc1_zh> t , <wc2_zh> t , ..., The softmax function is described (Liu et al. 2015) as

Topic Drift
Topics at a given period may evolve from the topics at the previous time interval. The

Dataset
Chennai

Topic Model -Results
Topics with posterior probabilities of words, hashtags and geotags are generated using

Evaluation
The evaluation is performed for topic model, topic dynamic model, topical word embedding and topic drift. For topic models, the common parameter used for evaluation is perplexity (Blei et al. 2003

Topic Model Evaluation
Based on perplexity computation by Blei et al. (2003), the perplexity for LDA_hash, LDA_geo and LDA_Tags are computed as The perplexity of the proposed LDA_Tags decreases when the number of topics has increased from 10 to 50. Comparison between LDA (only words distributions), LDA_hash (words and hashtags distributions), LDA_geo (words and geotags distributions) and LDA_Tags (words, hashtags and geotags distributions) with Chennai data shows that lower perplexity is achieved with LDA_Tags ( Figure 7). However, it doesn't give the details about the statistical significance.

Figure 7. Perplexity vs. Topics
To find the significance of the topic model with tags context, one-way ANOVA analysis (Table   5) with Chennai data is done for the perplexity parameter over time. Since baseline LDA does not include tags, it is not considered for ANOVA analysis. The null hypothesis assumed here is "The mean of perplexities over three intervals is same for all topic models". The perplexity mean of three topic models (Table 5) over three periods are statistically different between groups with the minimum value of (143.2254) for LDA_Tags. It is also found that the F-measure between groups (Table 6) is 46.0966 and the p-value is less than 0.05 (0.0002). Hence, the hypothesis is rejected. P-values from within groups details, clearly show that LDA_Tags is significant compared to others and gives improved performance for tags based context. Log posterior estimates of the probability distributions of words, hashtags and geolocations vary with time (Figures 8a, 8b), which confirm their dynamic nature and the need for varying minibatch sizes. There is more variation in the distribution of words compared to hashtags and geolocations in Chennai data (Figure 8a). However, hashtags have more dynamic variation in distribution compared to the nominal change in words and geolocations in UK data (Figure 8b).
This may be due to the abundant and frequent usage of social media by Twitter users in UK compared to Chennai.

Dynamic Model Evaluation
The relative perplexity of recent with previous intervals (Knights et al. 2009) is calculated as, Performance of dynamic topic model with varying mini-batch sizes (words -, hashtagsℎ , geolocations -) is compared with that of random and fixed (words, hashtags, geolocations -100) mini-batch sizes. The average perplexity varies nearly at the same rate and lower values are obtained for SGLD_vm compared to that of SGLD_fm and SGLD_rm (Figure 9a). However, the relative perplexity is high for SGLD_vm with Chennai data and there is no distinction between the dynamic models with UK data which is due to the smaller interval between snapshots.
Hence, SGLD_vm gives better performance compared to other models.

Topical Word Embedding Evaluation
Evaluation of TWE is achieved by visualizing the embedding vectors of top 5 topics using t-SNE distribution ( Figure 10). The plot of TWEs without joint learning using topics from LDA (Figure 10a), LDA_hash (Figure 10b), LDA_geo ( Figure 10c) and with joint learning (LDA_Tags) show that better grouping of topics is achieved with LDA_Tags ( Figure 10d).  very high divergence on 19-4-16 and 20-4-16 (Figure 11b) showing the impact of drift on 'job'.
This is due to the association of 'job' with hashtag '#o2job'.

KL Divergence of Topics
The mean squared deviation of topical embedding vectors of ten topics over time shows that topics are having less deviation during May compared to January and August (Figure 12a).
However, MSD of the topic 'jallikattu' is high during January (Figure 12a) which is the correct prediction of the duration of the protest compared to that of low divergence value (Figure 11a).
KL divergence also wrongly detects drift in 'gonna' and 'actor' which is not actually true. With UK data, the topics 'job', 'wind', 'weather' and 'books' have left side skew and 'london' and 'sales' have right side skew showing indications of the possibility of drift (Figure 12a). Among these 6 topics 'sales' has more variation denoting the drift in 'sales' during the interval. This is due to the co-occurrence of 'sales' with '#internet'. The proposed topical embedding model correctly predicts drift in 'sales' whereas KL divergence shows no deviation from the topic 'sales' (Figure   12b). To confirm the results, TWEs of Chennai data (January) have been examined for topical variation of 'jallikattu'. Visualization using t-SNE ( Figure 13) clearly shows the grouping of 'marina' with the topic 'jallikattu' due to the trends in the Twitter data during the protest.

Figure 13. Topical Word Embedding -t-SNE
This ensures that the proposed model detects the drift in topics accurately compared to other models based on topic distribution over words and hashtags. This is possible with the joint learning of hashtags and geotags context. The topic drift detection accuracy of models based on contexts of only words (LDA), words+hashtags (LDA_hash), words+geotags (LDA_geo) and words+hashtags+geotags (LDA_Tags) have been compared for top 100 topics with different mini-batch size options ( Figure 14).

Figure 14. Accuracy
It is found that higher accuracy (68% for Chennai data and 63.2% for UK data) is achieved with the proposed joint learning of TWE model with hashtags and geotags and with varying minibatch sizes compared with the other context models .

CONCLUSION
The proposed topical embedding model based on tags context represents topics and words in same semantic space effectively. The dynamic topic model SGLD with varying minibatch sizes performs well for computing the dynamic topic parameters with lower average perplexity and with higher relative perplexity for identifying the topic drift in the social media text with tags as topic indicators. Contextual changes in the topic distribution with hashtags and geolocations are better detected by the proposed topic model with tags contexts. MSD of topical embedding vectors detects the topic drift exactly than KL divergence of topic distribution. It can be extended in the future by adding different time level discretization for data ie. coarse level for words and geotags, and fine level for hashtags. Individual hashtags can be replaced by hashtag groups to accommodate more hashtags in the topical embedding model. Geolocations may also be grouped to study the topic change due to the community groups. Other cultural factors and events can also be related to the location based impact on the drift.

Conflict of interest:
All authors declare that they have no conflict of interest.