Detection of Political Intent through Analysis of Tweets and Homophily Elements

Twitter is one of the most popular social networking platforms of today’s generation and is a fundamental tool in harvesting the data of many users worldwide. Many discussions ranging from current affairs, news sharing, filing complaints to advertising and discussing some common interests etc. happen on Twitter. It is widely used by many famous politicians as a prominent communication medium to address large masses ,owing to its mass usage and popularity . Thus, it is safe to assume that people communicate their political ideologies on Twitter. Many works have been done so far to deduce the stance of user towards a particular party by performing sentiment analysis on their tweets using popular classifiers. To find connected and similar users, earlier works generated a social network graph, based on the assumption that friends and followers share similar interests, which might not be true in all cases. In contrast, the proposed work employs the concept of ensemble classifier (a single classifier generated from several base learning classifiers) to analyze the tweets and makes use of multiple interaction elements like followers/ following, mentions, re-tweets ,hash tags etc. to infer which political party a user identifies with. These interaction elements project out the homophily (users who share same beliefs and choices) amongst the users. The proposed study can be encompassed to any domain and can be used by advertising agencies, marketing companies, e-commerce, heath care etc. to identify their target audience. and period-length of in-teractions) the emoti-con datasets Twitter di-alects).Inferring non-English words (mostly in context), URLs, images and neutral tweets be a new domain of research. Notewor-thy domains where this research can be are e-commerce, marketing analysis, online


Introduction
Social networks have become a crucial part of today's generation for information and opinion exchange. One such popular site is Twitter with more than 1.45 billion registered users and 330 million monthly active users tweeting an approximate of 500 million tweets per day [1]. Twitter is the most common micro-blogging platform in countries like the United States, Japan and India. Its popularity has encouraged the researchers to analyze its contents and detect hidden patterns.
In recent years, Twitter has become a popular campaigning tool and medium of communication between the leaders and voters. Many prominent political leaders are very active on Twitter, and with lots of political discussions/opinions on-board, makes Twitter an authentic platform to predict a person's emotion and his political affinity.
Data present in Online Social Network (OSN) can be modelled as G= (V,E) where V embodies actors (people , groups etc.) and E symbolizes ties or relationships between the nodes [2]. Fig. 1 gives a glimpse of social network in the graphical form.
Data in Online Social Networks [3] [4] conveys the following:  Profiles: Give the portrayal of user/alterego to the outside world.  ˆConnections: Showcase the relationship between users like friendships, follow, following etc.  ˆMessages/Tweets/Hashtags: Include the interac-tions amongst users/groups.  Multimedia: Collective terminology for audio, video and pictures exchanged or uploaded by the users.  Tags/Mentions: Include the metadata linked to a particular content.  ˆPreferences: Consist of likes and causes supported which can be explicitly stated or hidden.  Groups: Collection of users sharing similar interests, background etc. and machine learning classifiers are employed to determine the notion of user towards a particular group (positive/negative). The rest of the paper is organized in the ensuing fashion. Section 2 touches upon the the prominent and latest research in the constituent parts of this study. Detailed introspection of the proposed methodology is presented in section 3 . Section 4 exhibits the obtained results with suitable error analysis. The paper is concluded with some discussions on possible future work in section 5.

Related Works
The proposed work can be partitioned into 3 major domains on which many prominent works are done. Some of the referred recent works are listed below.
 Behavioral information: Can be inferred from user activities in such sites.  Login credentials: Ensure that the use is restricted to authorized users. Sentiment Analysis [5] ,a popular Natural Language research area , is extensively used in many domains such as brand exploration, vote share prediction, current affair news analysis etc. User generated content in such online platforms are utilized by many advertising companies to market their products thereby improving their sales (through feedbacks).Generally machine learning classifiers [6] like Na¨ıve Bayes have shown better accuracy when compared to the lexicon based approach [7].
Two most common psychological traits observed in social media are Homophily and Social media influence. Homophily is the tendency of the users to interact with analogous minded people having similar likes. Social media influence highlights upon the fact that attitudes of the people get affected by their peers in their social circle [8].
Along with the tweets our research analyses the communication elements of user such as follow/following, mentions, retweets, hashtags etc. to gather similar users together.
The proposed methodology is two-fold process wherein the first part analyzes the tweets of the user using an ensemble [9] and deep learning classifier [10] to predict the political party of a user. This is supplemented with the analysis of tweets and sarcastic contents, if present. In the second part the similar users are grouped together using various interaction elements (retweets, mentions, following/followers, hashtags etc.) [11] [12]

Sentiment analysis & Prediction of chosen political
party. Any Sentiment Analysis approach falls into the following 2 broad categories:  Lexicon / Corpus based approach: Here the sentiments are analyzed by referring to any popular dictionary to detect schism of the opinion word.  Machine Learning based approach: Prominent classifiers are pre-trained and subsequently tested to determine the predilection of the tweet. Sharma et al. [13] (2018) enunciated the stance of people towards 2 popular parties Conservative and Labour. k-NN offered the best accuracy for their dataset and authors observed a rise in popularity of Labour party and a decline in that of Conservative party.
Widodo Budiharto and Meiliana Meiliana [14] (2018) applied sentiment analysis to predict the popular political leader among the Indonesians. The polarity of the tweets were decided based on total number of positive and negative words for a particular leader. This was augmented with likes and retweets favoring a leader. We have improved upon this work by adding more user signals in our research.
Alaoui et. al. [15] (2018) considered the semantic meaning of the tweets and developed their own labelled corpus using prominent hashtags used per party. Elongated words were given different priority scores and influence of a tweet was highlighted by the incorporation of retweets and likes (More retweets/likes equals more preference towards that party).The comparison of the proposed algorithm against Google Cloud prediction API and Na¨ıve Bayes showed favorable results. However, as quoted by the authors themselves, increasing their sample size can substantially improve their work. An interesting application involving deep learning approach for personality detection was observed in [16] [2] which emphasized on processing the semantic information in user generated texts. Xue et al. (2018) governed the linguistic digits of the tweets (no. of punctuations, capitals, text frequency, etc.) along with inferring the psychological aspects of a complete tweet set per user to detect the major emotion projected. It was observed that the proposed CNN based detection system improved the prediction accuracy of other regression models.
Nagarajan et al. [17] (2018) improved the classification accuracy of the decision tree by increasing the apt feature set using particle swarm optimization technique and genetic algorithm. They were successful in classifying emotions into multiple classes and their hybrid technique offered better detection than other base classifiers. The authors hope to upgrade upon their work so as to improve on the optimization.
Abhishek Kumar et al. [18](2019) incorporated a bidirectional LSTM which studies the context of a word in a sentence and ultimately identifies the sentence polarity. This two level approach first studied the word meanings through Distributional Thesaurus and ultimately applied it to the sentence to detect opinions and emotions. However, the error analysis of their approach showed that in compound sentences, many differing emotions tend to overwhelm the classifier often leading to misclassification.
Wang et al. [19] (2019) surveyed on various techniques for stance detection. They categorized their study into 2 main domains; one with textual content only and other where textual content was supplemented with metadata. Also they studied various works on feature extraction and opinion mining at various levels (corpus/document). They further highlighted on the challenges faced in online opinion mining and offered some possible solutions. They indicated the use of machine learning models which work on base data to increase the sample size. A move towards target specific attention based network for research was suggested.
Hima Suresh et al. [20] (2019) put forth an inventive approach for sentiment analysis by modifying the C4.5 decision tree which outperformed all the other classifiers by a significant margin. Ankita Sharma and Udayan Ghose [21] (2020) have analyzed the Twitter data in context of the general elections in India. In their study, the packages of R have been used to collect and pre-process the data. R and Rapid Miner's Alyien were incorporated to check the sentiment tallies (for or against), polarity and subjectivity of the tweet. Since the study was limited to the exploration of tweets, it can be extended to use multimedia, hashtags etc. Also a striking limitation observed was the while collection of tweets the location was not generalized causing bias in the study.
Mohd Zeeshan Ansari et al. [22] (2020) analyzed the political scenario in India via tweet analysis. Although the majority of the classifiers (LSTM, Decision Tree, Random Forest, SVM and Logistic Regression) used in their work matches with ours, labelling with human annotators did not get them a reasonable corpus. Future work in this domain would be using semisupervised classifiers to incorporate better reservoir (which includes more jargons and slangs) so as to improve the training and analysis of the classifiers.

Stance & Homophily Detection
Rossetti et al. [23] (2016) investigated the relationship between topological features of various networks and the degree of homophily between their subscribers. Skype, Last FM and Google+ users were inspected to determine similarity in usage pattern, listening activity and education respectively. Users were grouped using various community detection algorithms and new users were classified to appropriate groups based on their chronological, terrestrial and topographical features. Like us, they further plan to explore the strength of bonds between users to endorse better prediction.
Darwish et al. [24] (2017) put forth the analysis of interaction elements to detect homophily. The proposed similarity formula was able to amass the people sharing the same stance together. Their study yielded a good result for retweets / hash but not for a bunch of interaction elements together. Nonetheless, the authors speculated the reason for it being improper analysis of URLS.
Du et al. [25] (2017) proposed a novel neural network based stance detection algorithm called Target specific attention neural network which incorporates the following:  A RNN to extract object specific data.  LSTM and attention mechanism to muse on prominent parts of the texts with respect to the object and use the information articulated to detect stance. Further authors aim to find a suitable way to add external knowledge to improve the accuracy. They also plan to combine their approach to other machine learning algorithms.
Rajendran et al. (2018) [26] compared the accuracies of Grated Recurrent Unit (GRU) and Bidirectional Long Short-Term Dependency (LSTM) to detect the stance of the user on news data set crawled from popular news media. Bidirectional LSTM proved to offer better detection and we incorporated the parameters they used in our work.
Poddar et al. [27] (2018) employed CNN and RNN based encoders to determine the hidden stances in users' tweets. Many user' stances on a particular topic were then aggregated to check if it's a rumor or not. This model when combined with authenticity detection using transfer learning offered better results than its other contemporaries. Barone and Coscia [28] (2018) investigated the tax fraud in business partnerships by considering users as nodes and versions about them as edges. They gave a new take on trust computation and analyzed how trust score and similarities amongst groups tend to contribute to the homophily.
Mirko Lai et al. [29] (2019) investigated the opinion of the user on the reforms introduced in Italian Constitution by effectively analyzing the tweets and network elements such as retweets, mentions, replyto, follow/following relationships. We in our research also reached the same conclusion as proven by them that retweets, mentions, follow/following can exhibit homophily. They established the fact that difference of opinion could be expressed using reply-to which can be incorporated in our future analysis. In future, the authors aim at studying the influence of a prominent person/bot in changing the discernment of his peers.
Sailunaz and Alhajj [30] (2019) generated their own data set where tweets and replies were analyzed for the sentiments and emotions expressed. The user's influence was further determined using number of followers, retweets, likes etc. The authors could generate a customized recommendation system with their approach. However, their experiments focused on simple texts tailor-made for their approach and it needs to be improved to deal with random abbreviations, emoji, hashtags etc. prominent in any open platform.
Abeer Aldayel and Walid Magdy [31] (2019) categorized Twitter data into likes, interaction (mention, retweet, and replies), connection (follow) and textual contents. The stance was analyzed using linear SVM which offered good results. One of the major shortcomings of this method was it indicated that the presence of retweet, reply, mention as support which need not hold true always. The proposed classifier often misclassified the 'none' stance into either of the classes owing to biased and constrained dataset.
Hamdi et al. [32](2019) employed node2vec to extract important features from user follower/following and combined with user features extracted via Twitter APIs to detect fake news in Twitter. The extracted features were used as a metric to vouch for user credibility. In future, the authors plan onto extend the proposed methodology to detect rumors, junks etc. while cashing on the opportunities in domains such as recommendation systems.
Darwish et al. [33] (2020) exhibited unsupervised learning algorithm to categorize the stance of the users.Preprocessed tweets underwent dimensionality reduction to reduce the effect of outliers prior to clustering via DBSCAN and mean shift. The labeling time of the stances was significantly reduced when compared to supervised learning approaches. However, their method was limited only to tweets and can be extended to incorporate all other metadata/multimedia.

Emoji & Sarcasm Analysis
Fede et al. [34] (2017) scrutinized the emoji statistics usage on various social networking platforms and grouped them into 3 categories page rank, popularity and simultaneously used emoticons. The authors were able to deduce sarcasm by consequent conflicting emoticons (positive emoji followed by a negative one or vice versa.). They could model most common emoji used per subject (area of interest). They further plan onto explore semantic information conveyed if emojis are used as a language.
Chen et al. [35] (2018) explored the usage of a particular emoji in both positive and negative context. Further, their proposed attention based LSTM amalgamates this bi-sense embedding scheme to predict the sentimentalities in a better fashion, when compared to the prevailing techniques. In future, the authors intent to espouse multi-sense embedding too.
Subramanian et al. [36] (2019) worked on the research gap of ambiguities experienced in sarcasm detection owing to small texts in social media. In their approach sarcasm detection was done via texts and emoji used. Our work is inspired by the methodology followed by then to infer sarcasm from emoticons.
Li et al. [37] (2019) improved the sentiment analysis of Weibo tweets by integrating in depth emoji analysis with attention based GRU network which focuses on key words in a sentence. The emoji were classified into positive, negative and humorous. Misclassification occurred due to ambiguous usage of some emoticons and limited data set. The authors plan onto incorporate machine learning approaches / image processing to improve the prediction capability of their classifier. Cai et al. [38] (2019) expanded sarcasm detection to include image and textual attributes in detection. LSTM was used to infer apt textual features and the model developed by the authors extracted image traits to reckon if the tweet is sarcastic or not. The model failed to give appropriate classification when the attributes fetched from the text and images were contradictory. The authors hope to improve their work by including background knowledge and extend it to include audio visuals.
Potamias et al. [39] (2020)engineered a transformer based network architecture in order to eliminate the huge computation costs involved in data preprocessing as observed with the conventional methods.

Figure 3 Data Collection
Their RoBERta architecture which was an hybrid deep learning architecture involving both CNN and RNN ( used to grasp circumstantial and temporal information ) proved effective by a large margin in detection of figurative language, when compared to other state of art techniques.
Khotijah et al. [40] (2020) applied Paragraph2vec to find the context in the tweets and then classified if the tweets were sarcastic or not using LSTM. Their method worked well for Indonesian tweets, but was found deficient owing to sparse dataset in case of English tweet. Improvement can be done to scrutinize the expressed emotions in various contexts to improve the detection rate.
Sundararajan et al. [41] (2020) studied how user's mood change affects the level (polite, rude, rampant, and deadpan) of sarcasm expressed. Feature set for classification was gathered using ensemble classifiers and the strength of sarcasm was assessed with respect to positive/negative word count .Intensifiers used were evaluated using a fuzzy logic based approach. Tweets of a user over a time period was scrutinized to objectify his mood variations. The only limitation, we garnered is that the tweets were ranging over a wide span of topics, and were not contextual as exhibited in our study.

Metholology
The proposed methodology incorporates the following steps as shown in Fig. 2:  The data was collected from prominent Twitter handles and existing data set from Kaggle [42] was also used.  Tweets were analyzed to find the opinions of the users towards 'democratic' or 'republican' groups.  Sarcasm detection using Convolution neural network was done to enhance the sentiment analysis procedure.  Interaction elements of the users (friendships, retweets and the textual traits (hashtags and mentions) were accessed to predict their political affinity.
Cinton','@realDonaldTrump','@HillaryClinton'. Further using some seed users secondary user profiles were fetched from their network as shown in the Fig. 3 [8]. Following subsections explain the above mentioned steps in detail.

Data Collection
An ample list of prominent party handles , active party leaders's Twitter handle names, election campaign handles , common hashtags and search keywords of parties were prepared. Geo Tagging was applied to set the location to the United States. Kaggle data set was used as a secondary source to supplement our data set. Some of the popular keywords and handles used were 'Democrats', '

Sentiment Analysis
The process is illustrated in Fig. 4 and explained in Algorithm 2.

Pre-processing [50] [51]
 Words were reduced to their root words.  Word net corpus was used to lemmatize the words (ex. best is equivalent to good) to its root word.  Using Tokenizer Library sentence was split into its constituent parts to obtain the opinion words (usually the adjectives/adverbs). Negation words such as not were also identified.  Vectorization using Term frequency-inverse document frequency (TFIDF) was done to obtain the rare and frequent words used with respect to a party [52] [53].  Bag of words of each of the parties were protracted.  Labels of each party were converted to integer format to aid in classification process.

Applying sentiment analysis to the tweets.
Following classifiers are used for the process.

Sarcasm detection
To classify the tweets Convolution Neural Network algorithm [61] was used as shown in Fig.5.

Sarcasm Analysis
To enhance the accuracy of sentiment analysis, sarcasm exploration is also incorporated. Sarcasm is an ironic way to convey contempt and is often misleading (subjective to a person).Normally, sarcasm is used to appear funny, show angst and evade giving a straight answer. Our model labels sarcastic tweets towards "Democratic" as a preference towards "Republican" and vice versa. Algorithm 5 provides an insight on the procedure followed.

Data Collection and Pre-Processing
To collect the sarcastic tweet, the API was queried to return the tweets marked "'#sarcastic" which serves as a search indicator to detect the sarcastic tweets as they are subjective to a particular person and often elusive. Data was cleaned to remove noises following the steps in Algorithm 1.Finally, hashtag indicating sarcasm was removed and the data set was manually annotated (sarcastic tweets getting a value 1 and nonsarcastic having a value 0).

Homophily detection
Unlike direct tweets, here the contacts and the com-munication elements of the users were examined, to deduce the chosen political party from similar users. The connections analyzed in our work are described below [62].
. Following/Follower: It is assumed that people take interest in accounts of personalities with whom they share certain similarity or whom they like. The idea here is to accumulate the fol-lower/following group of a particular user, so as to infer his political ideology from them. We checked if majority of a particular user's friends follow "Democrats" or "Republicans" and inferred ac-cordingly.

Mention:
The mention similarity checked how often a particular party's handle or prominent leader's profile are mentioned by a particular user using '@' in a positive sense. It also analyses how often a particular person's handle is mentioned in the tweets of user, to access their friendship (Fur-ther inference of the user's political choice is done using that of his aide). ˆ Retweet : Users retweeting a particular person's tweet , shows their interest, which can be garnered as a way to find similarity. ˆ Hashtag Similarity : Use of common hashtags trending with respect to a particular party wass scrutinized to infer the sentiments towards these by a particular user.

Classification
Using the above signals 2 groups per signal is generated named as 'Trump1'/'Hillary1','Trump2'/'Hillary2', 'Trump3'/'Hillary3'and'Trump4'/'Hillary4'. Class was ← ← ← calculated for each of the signal and final class was said   where rose to 97%.Here every tweet pertaining to politics was analysed and inferred class (Democratic/Republican) count was calculated.Final class was assigned to the one having higher class count.

Analysis of Metadata
The signals can be mathematically presented in the following manner [62].
Follow/Following S1 = n + k (1) where n is the total number of user i's followers and k is the total number of users useri is following.
Mention ˆ function NU takes userid and hashtag as input and returns total neutral tweets tweeted on the hashtag by the user. ˆ q is the total number of hashtags tweeted by both ui and uj Table 2 and Fig.8 exhibit the results observed during individual signal computation. where the function noofTwtsreTweetd returns total number of uj's (can be a friend or a political party/its leader's handle) tweets that ui retweeted.
Results for prediction using combined signals are shown in Table 3 and Fig. 9.

Conclusion and Future Work
Twitter, a habitual podium nowadays, is widely used by many subscribers to share their moments and their stand on various issues oscillating from current affairs to mental health etc. via texts, support (retweets, likes, and mentions), multimedia, emoji and groups. Existing work emphasized only on the exploration of the tweets of the users to predict the party they vouch for. However, it was found to be ineffective in the situations where the users don't clarify their stand towards a particular political party explicitly. A single parameter like follow/following is not a viable measure to cluster similar users together, as their stance on all topics need not be the same. Hence, to address the above mentioned research gaps, the proposed methodology incorporated a dual process, wherein the first section analysed the prominent political tweets of a particular user, so as to infer which political party he prefers. This was supplemented with effectively categorizing the sarcastic tweets and the emoji used. Additionally, using the user's network (follower/following, retweets, mentions and hashtags), similar users were clustered together, following which his stance was inferred from his peers. Although both the methods gave a good accuracy score, they can be enhanced by including additional signals such as profile type similarity, likes, topography, geography etc. subjective to the application construed on. Content analysis of the bio can also assist in grouping people who are alike. A notable future work would be detecting the strength of user connec-