Sentiment Analysis of Tweets Using Machine Learning

doi:10.21203/rs.3.rs-3438836/v1

Download PDF

Research Article

Sentiment Analysis of Tweets Using Machine Learning

https://doi.org/10.21203/rs.3.rs-3438836/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Twitter is social networking platform were users can post updates in form of tweets to friends/relatives, etc. It has grown to such an extent that it has a giant dataset of the so-referred to as sentiments. Sentiment analysis of twitter is the study of feelings, emotions based on the tweets. In this paper, we have introduced an approach based on Information Gain and object extraction method (most repeating words). In addition, we also propose a sentiment analysis model to classify the most repeated words is either positive or negative. Its goal is to more effectively study sentiment. This model excelled in accurately and effectively analysing emotions.[5]

Sentiment analysis

twitter data

text mining

This is the generation of the Internet, and it has modified humans, specifically their views and opinions. It's now done, in particular, online through blogs, posts, product assessment websites, social media, etc. Currently, millions of mortal beings are using social media platforms like Instagram, Discord, etc. to explain/express their feelings, opinions, and commensurable perspectives about their everyday lives. Through online communities, we get interactive media in which purchasers tell and affect others via forums. Social media has grown to such an extent it produces a large volume of sentiment in the form of tweets, fame updates, blog posts, commentary, reviews, etc.[1]

Sentimental analysis is generally concerned with the voice of users for illustration, tweets and posts that serve as updates are examples. Sentimental analysis is an attempt to understand the emotion or passionate response to any communication. The tweet might be a judgment, an assessment, or an enthusiastic response. A vast number of tweets or comments on all motifs are made available on web these days, and the inspection may contain any content.[11] The two issues while analysing tweets are that the frequency of misspellings and the shoptalk which is much higher in tweets than any other disciplines. Secondly, Twitter posts dispatch a lot of motifs like blogs, news, and other spots, which are adapted to specific motifs [3].

The challenges faced during the sentimental analysis of twitter are as follows:

a) Neutral Tweets are treated as common in positive and negative ones. This is different from general sentiment analysis (e.g.:-reviews about an item) which is usually either positive or negative.

b) There are verbal representational challenges, like which result from point of engineering issues.

c) Tweets are truly short and which show limited sentiment cues.

In past years, quite a few studies have been carried out on the subject of "Sentiment Analysis on Twitter" by a number of scholars.[1]

Go and L.Huang (2009) proposed an answer for sentiment evaluation for twitter statistics via the use of remote supervision, wherein their education information consisted of tweets with emoticons, which served as noisy labels. Applying Naive Bayes, MaxEnt, and Support Vector Machines (SVM), they build models. It had positive, negative, and neutral characteristics. They came to the conclusion that SVM outperformed several trends and that unigram had been a more potent characteristic.

Pak and Paroubek(2010) a version that would classify the tweets as favourable, neutral, or negative. By compiling tweets using the Twitter API and automatically annotating the ones that included emotions, they produced a Twitter corpus. They developed a sentiment classifier based mostly on the multinomial Naive Bayes method using that corpus.

Barbosa.et.al.(2010) created a segment-automated sentiment evaluation method for putting tweets into different categories. They classified tweets as objective or subjective before classifying the subjective tweets as positive or unfavourable in the subsequent step. Retweets, hash tags, links, punctuation, exclamation points, as well as features like previous polarity of phrases and POS, were all included in the function area.

Bifet and Frank(2010) utilised Twitter streaming data provided by the Firehouse API, which provided all real-time messages from each user that were visible to the public. They came to the conclusion that the SGD-based model turned out to be superior to the others employed when utilized with the right learning rate.

Davido.et.al.,(2010) offered a method to leverage punctuation, single words, n-grams, and styles as specific function types that are then combined into a single function vector for sentiment classification in order to employ Twitter user-described hash tags in tweets as a kind of sentiment. By building a function vector for each instance, they were able to assign sentiment labels using the K-Nearest Neighbour approach.

Agarwal.et.al.(2011) developed a three-way approach for categorizing sentiment into good, terrible, and neutral categories. They experimented with many models, including the unigram version, a version based on features, and a version based on a tree kernel. They displayed tweets as a tree for the version that was mostly based on the tree kernel. One hundred capabilities are used in the entire version based on functions, and over 10,000 in the unigram version. They came to the conclusion that functions that combine the previous polarity of phrases with their parts-of-speech (pos) tags are essential and play a top-notch role in the typing work. The contrary models were implemented by the tree kernel, which was based on the entire version.

Turne.et.al used bag-of-phrases technique for sentiment evaluation, wherein the relationships among phrases were never taken into consideration and a report was represented as only a series of words. To decide the sentiment for the entire document, the sentiments of each phrase were decided and people's values were united with a few aggregation functions.

Kamps.et.al. worked on WordNet's lexical database to identify a phrase's emotional meaning combined with its extraordinary dimensions. On WordNet, they developed a distance metric and determined the semantic polarity of adjectives.

Luoet. al. highlighted the demanding situations and green strategies to mine reviews from Twitter tweets. Spam and wildly varied language make opinion retrieval inside Twitter a tough task.

Sentimental analysis is a method of analysing and reading or trying to understand the person's feelings, reviews, and thoughts. Sentimental analysis is frequently referred to as opinion mining, as it mines the essential functions from human beings opinions. Sentimental analysis is executed as explained in Fig. 1. by means of the usage of numerous device mastering techniques, statistical fashions, and natural language processing (NLP) for the extraction of function from massive data. At document level, a summary of the entire document is taken first, and also it's analysed to determine whether the sentiment is either positive, negative, or neutral. At the word level, the evaluation of terms in a sentence is taken into account to test the polarity (positive or negative). At the sentence level, each sentence is labelled in a selected magnificence to offer the sentiment [3].

Sentimental Analysis has numerous applications. It is used to generate evaluations for humans of social media with the aid of studying their emotions or minds, which they offer in the form of text. In sentimental analysis the results of one domain cannot be applied to other domain. Sentimental analysis is used in real life, to get evaluations of any product or movie, analyse teachers feedback receive from students.

3.1 Pre-processing of datasets (data collection)

A tweet generally contains a lot of opinions or feelings expressed in various ways by different users. The Twitter dataset that we are using for this survey is already labelled "1" and "0"as shown below in Fig. 2. Here, 1 indicates a "racist or sexiest" tweet and 0 indicates a "non-racist or non-sexiest" tweet; thus, it becomes easy to observe the effect of various key factors.[4] In order to analyse it, we need to import and pre-process the dataset.

The pre-processing of tweets has the following points:

Remove all URLs (e.g., www.abc.com), hash tags (e.g., #hash tag), and targets (@username).
Corrections of spellings and the sequence of repeated characters are to be handled.
Replace all the emotions with their sentiments.
Clear all punctuation, symbols, and numbers.
Clear Stop Words.
Expand Acronyms.
Clear non-English tweets

3.2 Exploratory data analysis

Some of the key factors are the following:

Words and their frequency of Repetition:

Unigram’s, bigram’s and n-gram’s model were based on repetition count which are also called as key factors. The words most repeated were shown using graphs or word cloud library.

Parts of Speech

Parts of speech such as adjective, adverbs, noun and group of verbs are also considered as an indicator to the sentiment.

Position of Text

The position of a text can affect the overall meaning of a sentiment based on a particular word.

Eg:- Misplaced:-

Most guests in marriage selected a lunch from the menu that was high in sugar.

Corrected:-

Most guests in marriage selected a lunch that was high in sugar from the menu [12].

3.3 Classification

3.3.1 Naïve Bayes:

Naïve Bayes algorithm is type of supervised learning algorithm, which used for solving classification problems. It is specially utilized in textual content category that consists of a high-dimensional dataset. Naïve Bayes Classifier is one of the easy and handiest Classification algorithms which allows in constructing the quick device studying fashions that may make short predictions. It is a probabilistic classifier, because of this that it predicts on the idea of the possibility of an object.

$${E}^{*}=\text{arg}{mac}_{e}{P}_{NB}\left(e|g\right)$$

$${P}_{NB}\left(e|g\right)=\frac{\left(P\left(e\right)\right){\sum }_{i=-1}^{m} p{\left(f | e\right)}^{{{n}_{i}}_{\left(g\right)}}}{P\left(g\right)}$$

Assume "g" is considered to be a tweet and "e*" is a class which is assigned to g, where "f" is a feature. The count of the feature (${f}_{i}$) is denoted with ${n}_{i}$(g) and is present in g, which represents a tweet. Here, m stands for the number of features. Parameters P(e) and P(f | e) are calculated using maximum possible estimates[10]. We generally use the Python NLTK library to train and classify using the Naïve Bayes algorithm.

3.3.2 Maximum Entropy:

A method called maximum entropy classification extends logistic regression to multiclass issues. The maximum entropy model is a type of log-linear model.

$${P}_{ME}\left(e|g,\lambda \right)=\frac{\text{e}\text{x}\text{p}[\sum _{i}{\lambda }_{i}{f}_{i}\left(e,g\right) ]}{\sum _{e}\text{e}\text{x}\text{p}[\sum _{i}{\lambda }_{i}{f}_{i}(e,g)]}$$

3.3.3 Support Vector Machine:

Support vector machine is a dynamic machine learning algorithm which is mainly used for linear or nonlinear classification and regression. Support vector machines analyse the data, specify the decision limits, and employ input space-based kernels for computation. The units of the enter statistics are vectors, each with a length of m. Then, each piece of data that is represented as a vector is put into a class. The margin between the two classes that is far from any document is the next thing we discover. Additionally, SVM supports classification and regression, which are helpful for statistical learning theory. It also makes it easier to identify the precise components that must be taken into consideration for it to be properly understood.[1]

3.4 Training

A crucial method for resolving any classification problem is supervised learning. Training makes it easier to anticipate future outcomes from unknown data.

3.5 Approach for the sentimental analysis

3.5.1. Unsupervised learning:

Unsupervised learning does not consists of any category and does not even provide with correct result.

3.5.2. Supervised learning:

Supervised learning is usually based on labelled dataset. These labelled dataset have been developed to get correct result which when used for decision-making. The outcome of this is determined by the choice and extraction of the precise sentiment-detection components.

The sentiment analysis machine learning method largely adheres to supervised classification. To categorize the tweets, a variety of machine learning approaches have been introduced. In sentiment analysis, machine learning approaches like Naive Bayes[3.3.1], maximum entropy (ME)[3.3.2], and support vector machine (SVM)[3.3.3] have been quite successful.

We utilized the freely accessible twitter dataset provided by Kaggle. An analysis was done on this labelled datasets. They were labelled as ‘0’ and ‘1’. 0 for racist and sexist tweets whereas 1 for non-racist and non-sexist tweets. Later we classified them using keyword factor extraction technique. Using a framework that applies a pre-processor to the raw sentences to make them more understandable, we exhibited the terms that appear the most frequently using a word cloud library as shown below in Fig. 3. and Fig. 4.

Sentiment Analysis is a very challenging task. The following are some of the difficulties encountered when analysing Twitter sentiment:

a) Identifying parts of text: Some of it consists of sentimental material. In certain instances, the same term can be treated as subjective, while in others it can be treated as objective. The subjective quantities of text are therefore difficult to perceive.

For Example:

She loves the snow. (Opinion).
She loves the snow because she is a skier. (Opinion with a fact).

b) Sarcasm Detection: Sarcastic sentences use positive words in an original way to convey a target's negative judgment. For Example: “Silence is golden. Duct tape is silver.”

c) Entity Recognition: There is need to separate out the text from the sentence and then analyse the sentiment. For example: In the sentence “Rahul S is one of the founders of ABC Company, a company from the State 1”. The entities are:-

Person:- Rahul S
Location:- State 1
Company:- ABC Company

d) Handling Comparisons: Certain words does not handle comparisons.

Example

“Tejas is far more involved in charity than Raj”. Using the bag of words model, the tweet would be deemed favourable for both Tejas and Raj because it disregards the relationship to "better."

e) Applying sentiment analysis to Facebook posts: There has been less work on Facebook sentiment analysis because of restrictions and security policies.

Funding - No funding was received to assist with the preparation of this manuscript.
Competing Interests- The authors have no competing interests to declare that are relevant to the content of this article.
Ethics approval - Not Applicable
Consent to participate – Not Applicable.
Consent for publication –Not Applicable.
Availability of data-The data are is a github repository. https://github.com/prateekr0000/Sentimental-Analysis-of-Tweets contains the data.
Code Availability- The data are in a github repository. https://github.com/prateekr0000/Sentimental-Analysis-of-Tweets contains the code.
Authors' contributions –
- Study conception and design: Prateek Rasalkar, Prof Javed Patel and Vikram Velankar;
- Data collection: Vikram Velankar; analysis and interpretation of results: Prateek Rasalkar and Naman Chordiya;
- draft manuscript preparation: Prateek Rasalkar and Naman Chordiya. All authors reviewed the results and approved the final version of the manuscript.

Sentiment Analysis of Twitter Data: A Survey of Techniques [International Journal of Computer Applications 139(11): 5-15, April 2016] by Vishal Kharde.
Sentiment analysis of twitter data using machine learning approaches and semantic analysis[https://doi.org/10.1109/IC3.2014.6897213] by Geetika Gautam, Divakar Yadav.
Tweet sentiment analysis with classifier ensembles [https://www.sciencedirect.com/science/article/abs/pii/S0167923614001997?via%3Dihub] by Eduardo R. Hruschka and Nádia F.F. da Silva
Multi Aspect Based Document Level Sentiment Analysis for Educational Institute Analysis [http://ijircce.com/admin/main/storage/app/pdf/eIKg7c7DNS3jKze94fjJj3S7fPyNR6uPXZu7oNU8.pdf]
Sentimental analysis from imbalanced code-mixed data using machine learning approaches [https://link.springer.com/article/10.1007/s10619-021-07331-4] by R. Srinivasan and C. N. Subalalitha
A Review towards the Sentiment Analysis Techniques for the Analysis of Twitter Data[Proceedings of 2nd International Conference on Advanced Computing and Software Engineering (ICACSE) 2019].
Sentiment Analysis on Twitter Data [International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163 Issue 1, Volume 2 (January 2015)].
Sentiment Analysis[IEEE:https://doi.org/10.1109/ICETACS.2013.6 6913 79, Published on 2013 1st International Conference on Emerging Trends and Applications in Computer Science].
Sentiment analysis using product review data[ Xing Fang &Justin (https://journalofbigdata.springeropen.com/articles/10.1186/s40537-015-0015-2) ].
A survey on sentiment analysis methods, applications, and challenges[by Mayur Wankhade, Annavarapu Chandra Sekhara Rao & Chaitanya Kulkarni]( https://rdcu.be/dlSnL).
Sentiment Analysis of Twitter Data: A Survey of Techniques[International Journal of Computer Applications 139(11):5-15; DOI:10.5120/ijca2016908625].
Pooja Kumari, Shikha Singh, Devika More and Dakshata Talpade, "Sentiment Analysis of Tweets", IJSTE - International Journal of Science Technology & Engineering, vol. 1, no. 10, pp. 130-134, 2015, ISSN 2349-784X.
V Prakruthi, D Sindhu and S Anupama Kumar, "Real Time Sentiment Anlysis Of Twitter Posts", 3rd IEEE International Conference on Computational System and Information Technology for Sustainable Solutions 2018, ISBN 978-1-5386-6078-2,2018.
Sahar A. El_Rahman, Feddah Alhumaidi AlOtaib and Wejdan Abdullah AlShehri, "Sentiment Analysis of Twitter Data", 2019 International Conference on Computer and Information Sciences (ICCIS), 2019, ISBN 978-1-5386-8125-1.
Mohammed H. Abd El-Jawad, Rania Hodhod and Yasser M. K. Omar, "Sentiment Analysis of Social Media Networks Using Machine Learning", 2018 14th International Computer Engineering Conference (ICENCO), 2018, ISBN 978-1-5386-5117-9.
Adyan Marendra Ramadhaniand Hong Soon Goo, "Twitter Sentiment Analysis Using Deep Learning Methods", 2017 7th International Annual Engineering Seminar (InAES), 2017, ISBN 978-1-5386-3111-9.
Rasika Wagh and Payal Punde, "Survey on Sentiment Analysis using Twitter Dataset", 2nd International conference on Electronics Communication and Aerospace Technology (ICECA 2018) IEEE Conference, ISBN 978-1-5386-0965-1,2018.
Ali Hasan, Sana Moin, Ahmad Karim and Shahaboddin Shamshirband, "Machine Learning-Based Sentiment Analysis for Twitter Accounts", Mathematical and Computational Applications, vol. 21, no. 1, 2016, ISSN 2297-8747.

No competing interests reported.

Latexsupportingfiles.docx

Download PDF

Version 1

posted

You are reading this latest preprint version

Sentiment Analysis of Tweets Using Machine Learning

Status:

Version 1

Abstract

Figures

1 Introduction

2 Related Works

3 Sentimental Analysis

3.1 Pre-processing of datasets (data collection)

3.2 Exploratory data analysis

3.3 Classification

3.3.1 Naïve Bayes:

3.3.2 Maximum Entropy:

3.3.3 Support Vector Machine:

3.4 Training

3.5 Approach for the sentimental analysis

3.5.1. Unsupervised learning:

3.5.2. Supervised learning:

4 Results

5 Challenges faced in sentimental analysis

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1