Hybrid model for sentiment analysis combination of PSO, genetic algorithm and voting classification

ABSTRACT


INTRODUCTION
Sentiment classification, polarity classification, affect analysis, opining mining, subjectivity analysis or Sentiment analysis is an interdisciplinary branch of research, that studies opinion of people about a various product.It combines elements of knowledge discovery, computational linguistics, natural language processing (NLP), and other fields.In contrast to actual facts, opinions and sentiments are subjective, therefore it is possible to extract them from a text using a range of computational and NLP-based approaches.As social network services like Weibo and Twitter have grown in popularity, analyzing the sentiment has attracted the NLP researchers in the last few years [1], [2].On social networks, user opinions are often expressed in brief texts that includes a user's opinion of a certain object [3], [4].User sentiment refers to a user's perception or attitude towards a circumstance or event (called topic).Sentiment analysis for social media texts now faces new difficulties in comparison to sentiment analysis for traditional web texts.In contrast, popular subjects on social media are continuously shifting.Sentiment analysis, themes, and domains are well known to be closely related [5].Manually collection of labelled data from the vast array of subjects covered by large-scale social media to train sentiment classifiers across several domains would be extremely difficult.Through text data, sentiment analysis may effectively disclose people's opinions (whether favorable, unfavorable, or neutral) about a good or service [6], [7].The subtasks in Figure 1 of sentiment analysis are carried out by examining the user contents shared on social media.
Six components make up the system: a noise filter, a sentiment and emotion analysis engine module, a prediction analyzer, a results viewer, and a database.The "Data Collector" crawls unprocessed data from a variety of websites, including blogs, Twitter, and discussion forums.The "Noise Filter" module receives data from this module, which is a collection of scripts, and processes it, depending on whether the data sources give programmatic interface to read data [8], [9]."Meaningless data," like ads, meaningless stuff with no comments, also noises relevant to some content are removed by the "Noise Filter/Smart Filter" [10].
In order to be used for crucial business tasks such as forecasting, monitoring, and action strategizing, Predictive analysis of events, like revenue and reputation crises, is carried out via the "Predictive Analyser".The pool of prediction algorithms and the predictor/feature set are its two main parts.The result of sentiment and emotion analysis, containing emotions like anger, grief, and anxiety as well as positive, negative, neutral, and mixed sentiments, serves as a new predictor in addition to the existing predictors and features.It will be possible to do time-series analysis, anomaly identification and customer preference analysis for sales forecasting by integrating the sentiment results and combining the prediction algorithm pool's extra insights with the emotion analysis engine [11], [12].
This research article is based on the twitter data sentiment analysis.Feature extraction, classification, and pre-processing are the three stages of the sentiment analysis process.The key step is feature extraction in which attribute set will establish relation with target set [12]- [14].Previous research shows that author do not concentrate on the feature extraction phase which leads to low accuracy.The research work proposes a hybrid optimization model for feature extraction to increase accuracy for the sentiment analysis.Figure 1.Social media sentiment analysis system [13] Section 1, presents the introduction about the sentiment analysis with basic steps of social media sentiment analysis system [15].The various models are already proposed by authors in previous years for the sentiment analysis.Section 2 of the article highlights the literature survey on sentiment analysis.Section 3 presents the proposed model with pseudo codes and flowcharts.Section 4 shows results and discussion in which detail results of proposed model is presented and compared with existing models.The last section of the paper concludes the article.

LITERATURE SURVEY
Hybrid algorithms can achieve faster convergence and higher enhancement.Additionally, hybrid algorithms can be more effective than thoroughbred optimization algorithms in solving engineering problems.However, hybrid algorithms can increase the difficulty of setting control parameters and may require specific design for problems, section presents a brief study on enhancing performance of opinion mining by employing diversified hybrid techniques.
Long et al. [15] developed an effective Bi-directional Encoder Representation from Transformers based Convolution Bi-directional Recurrent Neural Network to investigate the syntactic and semantic information for analyzing the data sentimentally and contextually.Firstly, the polarity scores were computed and assigned labels on the basis of the zero-shot classification.After that, the sentence-level semantics and contextual attributes were acquired and the embeddings were generated using a pre-trained BERT algorithm.Bi-directional long short-term memory (Bi-LSTM) sequenced the sentences.
Mahadevaswamy and Swathi [16] discussed that analysing the sentiment was a cognitive tool for extracting the emotional tone from a text.A Bi-LSTM algorithm was designed.It handled the long-term dependencies for which the memory was employed into the model to make superior predictions.The Amazon Product dataset having 104,975 was applied in the experiments.The algorithm emphasized on classifying the reviews into positive and negative classes.At last, the results exhibited that the designed algorithm yielded the accuracy around 91.4% as compared to others.
Li et al. [17] suggested Humor-EMOji-Slang-based (HEMOS) system to classify the sentiment for the Chinese language on the basis Deep Learning technique.The significance to identify the humor, pictograms and slang on the task related to process the social media was computed.Firstly 576 frequent Internet slang expressions were gathered as a slang lexicon.Then transformed Weibo emojis into textual attributes when a Chinese emoji lexicon was produced.The last stage employed both lexicons to AttBiLSTM (attention-based bi-directional long short-term memory recurrent neural network) and its efficacy was computed.
Pimpalkar and Raj [18] established a model based on Deep Learning and the GloVe word embedding to extract the relevant semantics of words in texts.CNN layer employed for learning the attributes.multi-layered Bi-directional long-short-term memory (MBiLSTM) deployed these portions for capturing the long-range embedded circumstances.An experiment was conducted for offering a precise answer for examining the feelings and user reviews into positive or negative.The established model yielded an accuracy of 92.05% on system test and 93.55% on validation.The supremacy of the model was proved on IMDB datasets.
Wang et al. [19] introduced a sentiment approach for analyzing the sentiments.At first, the optimal sentiment notion of words was presented in Microsoft Concept Graph (MCG) concerning the context of words.Secondly, the goal of this method was to extract the sentiment information associated with the words from the multi-semantics sentiment intensity lexicon.Correct representation of word semantics and sentiment was made possible by its ability to integrate sentiment information.Ultimately, a more comprehensive word representation was achieved by combining two improved word embeddings (WE) approaches.The efficacy of the strategies exhibited was documented by the outcomes across six datasets.
Tan et al. [20] developed a hybrid Deep learning technique called RoBERTa-LSTM which combined robustly optimized bidirectional encoder representation (BERT) with long short-term memory (LSTM) to analyse the sentiments.The initial algorithm was implemented to map the words into a compact WE space and the latter one helped in capturing the long-distance contextual semantics in an effective way.The experimental outcomes exhibited that the developed technique yielded a F1-score of 93% on IMDb dataset, 91% on Twitter US Airline dataset and 90% on Sentiment140 dataset.
He et al. [21] projected a fusion sentiment analysis (FSA) technique that integrated textual analysis methods with machine learning (ML) algorithms for mining the online product experience.The technique was executed in 3 stages.The sentiment dictionary was deployed as first stage for extracting the sentiment attributes.After that, the support vector machine (SVM) algorithm was put forward for recognizing sentiment polarities of reviews.Linear discriminant analysis (LDA) was utilized to extract the sentiment topics from the reviews.Generally, the weighting technique was employed for computing the sentiment contribution.In the end, the reading experiences of customers for online books available on Amazon was considered to quantify the projected technique.The experimental outcomes revealed the efficacy of the projected technique to analyse the sentiments of reviews and capture the elements which affected the reading experiences from reviews.
Ahmad et al. [22] introduced a technique known as attention-based multi-channel gated recurrent neural network (Att-MC-GRU) for extracting the aspects and analysing the textual reviews while classifying their sentiments.In this, a hybrid method of WE, part of speech (POS) tags, and contextual position information was put forward.This method was useful for enhancing the accuracy to recognize and predict the aspects and their associated sentiments.[23] suggested a modified Bayesian Boosting approach with weight-guided optimal feature selection for sentiment analysis.It included a feature selection system that weights distinct features in accordance with how crucial they are for sentiment categorization.This weight-guided feature selection method improved classification performance by identifying the most pertinent and discriminative characteristics.To maximize the boosting process, the program used a modified Bayesian boosting technique.
The availability of integrated tools and methodologies for sentiment analysis that let users experiment and evaluate various algorithms and optimisations based on personalised preferences and parameters is lacking.The work suggests a hybrid method for sentiment analysis using GA and PSO algorithms.This algorithm is helpful in choosing the optimisation features and encoding a workable solution for a problem into an individual.Each individual is viewed as an entity that supports chromosomal traits.A population is made up of several individuals [24], [25].
GA serves as an optimum search approach using natural selection and the survival of the fittest, this optimises selection of relevant word sequence from random corpus.GA operates on individuals i.e fixed size population and proceses the dataset without biases that evolves over time.PSO is a fast gradient method and has a faster convergence, depending on their present position in the particle space and the associated velocity vector, the variables in PSO can have any values.Hybrid algorithms can achieve faster convergence and higher enhancement by combining different types of algorithms.Hybrid algorithms can be more effective than pure optimization algorithms in solving engineering problems.However, hybrid algorithms increases the difficulty of setting control parameters and may require specific design for problems.Our goal is to integrate PSO's exploration capabilities with GA's exploitation capabilities.The proposed method initially runs PSO so that best suited particles of sequence of words from the space can be identified, GA then optimizes to estimate the most effective sequence.To extract sufficient information, the proposed algorithm classifies the combined result by voting classification, a combination of Random Forest, SVM and KNN.SVM provides an exhaustive solution for the bias variance trade-off problem.Identifiers obtained were clasified using voting classifier to provide numbers that indicate the polarity values of certain words in the lexicon.Classifier forecasts an output based on the class with the highest likelihood of becoming the output.

METHOD
Data on social media is growing exponentially, diversified automated techniques for analysis have been proposed assisted with fuzzy systems, AI techniques, opinion mining techniques, heuristic techniques etc, work presents an algorithm combining PSO and GA for the feature extraction and classification is done using voting method.This work analyzes the sentiments on twitter data.The new mechanism has diverse stages which are defined as

Extraction of microblogs data and its pre-processing
Several clients provide information in diverse forms like the tweets for expressing their sentiments related to various topics.The twitter data sample is implemented in two categories: pessimistic and positive.Thus, the emotion scrutiny of information can be done easily in order to analyze the outcomes of several properties.But, the susceptibility for discrepancy and superfluous is found higher for unprocessed information.

Feature extraction
The crucial step in establishing the link between the attribute set and the target set is feature extraction.Hybrid optimisation algorithm combines the PSO with GA.The hybrid model is represented by the flowchart.This model may be used to choose the optimisation attributes and encode a workable solution to a problem into an individual.Each individual is viewed as an entity that supports chromosomal traits.Together, a number of individuals form a population.Prior to using GA, the main goal is to randomly create a population of chromosomes and surround it with problem variables.Next, the step focuses on analysing the generated chromosome data.The chromosomes are important for creating more chromosomes since they may clearly illustrate the best way to approach the problem.The population in this algorithm refers to the main collection of random solutions.For the purpose of coding for a solution to address the problem, each individual of the population is illustrated using a chromosome.The formula for decoding is (1).

𝑋
. The generations, or different iterations, are used to create the chromosomes.A variety of fitness indicators are used in each generation to assess the chromosomes' fitness value.A relation exists between each particle i and, the velocity vector   = [ ,1 ,  ,2 , … ,  , ] and the position vector   = [ ,1 ,  ,2 , … ,  , ].To carry out their search procedure, the positions of novel solutions,  , are changed at a constant rate.This method concentrates on recalling the historical position of an individual as   , and  defines the current global best position that the particle swarm has found.Finding such locations causes all particle positions and velocities to be updated in accordance with the provided equations as ( 2) and (3).

𝑣 𝑖,𝑑 (𝑡
, ( + 1) =  , () +  , ( + 1) This uses t to show the t-th iteration, d to show the particle's d-th dimension,  to indicate the inertia weight, c_1 and c_2 to show the acceleration constants, and  1 and  2 to specify the random numbers, whose distribution is done at random within the interval [0, 1].The algorithm's efficiency is improved by reducing the inertia weight ω.Weight's definition is (4).
This indicates the maximum weight with   , the minimum weight with   , the number of the current iteration with t, the maximum iteration with   .Two chromosomes from the present generation are integrated using a crossover operator or a mutation operator (MO) to create the offspring.An innovative generation keeps the population size constant.Based on fitness values, certain parents and offspring are chosen, while others are disqualified for creating this generation.There are other options for optimum chromosomes.The proposed model for feature extraction is illustrated in Figure 2.

Feature selection algorithm
Work suggests a hybrid method for sentiment analysis using GA and PSO algorithms.This algorithm is helpful in choosing the optimisation features and encoding a workable solution for a problem into an individual.Each individual is viewed as an entity that supports chromosomal traits, mentioned below is the GA Algorithm 1 for the model: Algorithm 1: GA -Design for model proposed.Model proposed uses GA for feature selection, GA serves as an optimum search approach using natural selection and the survival of the fittest, Output from Algorithm 1: GA is processed through PSO technique, as PSO is a fast gradient method, therefore model achieves a faster convergence.Given below is the process in Algorithm 2: Algorithm 2. Algorithm -designed.

Classification
Several classification techniques, including random forest, SVM, and KNN, used in this study for sentiment analysis.The voting classification is the combination of Random Forest, SVM and KNN.The idea of statistical learning is applied in SVM.Without addressing the more complicated problem, this method finds an indirect solution to the primary problem.The bias variance trade-off problem has a comprehensive solution provided by SVM.There are two methods to put this algorithm into practice.While the second uses kernel functions, the first involves mathematical programming.This approach uses kernel functions to split the data into P and N classes.When yi = +1, class P applies, and when yi =-1, class N applies.
With equal distances from each class, SVM looks for the best separation surface, or hyperplane.A combination of tree predictors makes up this method.Almost always, this algorithm produces results that are deemed appropriate.Its performance is hard to improve.This method handles a variety of data types, including nominal, binary, and numerical data.There are several trees built in RF.These trees are merged to provide precise and relevant results.Both regression and classification are handled by the RF algorithm.KNN algorithm is one type of supervised machine learning technique used to solve the prediction issues related to regression and classification.

Voting classifier algorithm
Proposed technique uses a hybrid optimization algorithm combining PSO and GA for feature extraction and employs voting classification, a combination of random forest, SVM and KNN for labeling.SVM provides an exhaustive solution for the bias variance trade-off problem, given below is the Algorithm 3 for voting classification: Algorithm 3: Voting classification.

RESULTS AND DISCUSSION
The combination of hybrid optimization algorithm combining PSO and GA for feature extraction and voting classification for classification is proposed.The suggested model is evaluated on approx.five datasets and compared with 4 ensemble datasets.1964 instances are there in Dataset 1 as seen in Table 2, 6439 instances are there in dataset 2 as can be seen in Table 3, 1960 instances are there in dataset 3 as seen in Table 4.These datasets are gathered via the Tweepy API [33] and also gathered from the twitter.Three metrics are used to evaluate each ensemble's and the proposed classifier's performance: accuracy, precision, and recall.As illustrated in Figures 3-5 respectively, the outcomes of the proposed model are contrasted with those of four existing ensemble models on 3 different datasets.The analysis demonstrates that the proposed model achieves superior levels of accuracy, precision, and recall when compared to the other ensemble models.
Hybrid algorithms can achieve faster convergence, work suggests a hybrid model to sentiment analysis which combines GA and PSO algorithm, GA has been employed for identification of best suited prominant sequence of words from random space with concept of survival of fittest for opinion mining, PSO achive fast convergence and SVM as combination in voting classifer for effective handling of bias variance trade off while classifying sequences.Technqiue proposed classifies the result by voting classification, a combination of Random Forest, SVM, and KNN.Technique achives a consistent accuracy percentage over diversified database.The proposed model is compared with 4 ensemble datasets.Dataset 1 achieves an accuracy of 94.45%, dataset 2 attains an accuracy of 95.56% and the dataset 3 attains an accuracy of 96.89%.Technique proposed achieves a consistent accuracy of more than 90% for three different diversified database owing to natural selection of sequences by GA and at the same time achieves a fast convergence with PSO, the model may be employed for highly accurate recommender systems demanding precision and accuracy.

CONCLUSION
Social media data can be automatically analyzed to determine the polarity of posted opinions thanks to sentiment analysis technologies.In recent years, these systems have been expanded to examine additional factors, such as a user's attitude towards a subject or their emotions, and have even combined text analytics with other inputs, like multimedia analysis or analysis of social media.It is observed that existing models are unable to achieve high accuracy because those algorithms are unable to extract features properly.In this research work for feature extraction a hybrid algorithm which is the combination of PSO and GA is proposed then for the classifiction a voting method is applied.The suggested model is evaluated using three datasets with varying numbers of instances and compared with four existing ensemble models to test reliability of the suggested model.It is analyzed from results that suggested model achieve maximum recall, precision and accuracy.The suggested novel model is compared with four ensemble models which are ensemble 1,2,3, and 4. The ensemble models are also the hybrid model for the sentiment analysis.The accuracy of ensemble models is between 75 to 85% approx.on all three datasets and on the other side accuracy of proposed model is between 90% to 95%.These results illustrate that proposed model has approx.8% to 10% high accuracy as compared to ensemble models for sentiment analysis.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752  Hybrid model for sentiment analysis combination of PSO, genetic algorithm and … (Garima Srivastava) 1155

Table 1 .
Table1summarizes evolution of hybrid algorithms with diversified combinations for enhanced opinion mining performance, number of techniques with diversified combinations have been studied, however following gaps still exist a) Optimum technique for identification of best suited prominant sequence of words from random space with concept of survival of fittest for opinion mining b) Effective handling of bias variance trade off while classifying sequences c) Consistent accuracy in predictions over diversified databases Summary of evolution of hybrid algorithms Identify an intermediate -population   from previous versions  −1 .7: Link a numeral randomly -r within range (0,1) with   as each row.8: Compare:  <   , if [Conditional loop] then 9: Register Crossover Parameter to all identified duals of   .10: Establish updated   .11: Stop -if [Conditional loop] 12: Link a numeral randomly -r1 within range (0,1) for every gene in every entity for   .13:  1 <   , if then 14: Transmute -Gene via creating a novel merit value -Via randomly identified gene within its own domain.-Mutation operator 15: Establish updated   .16: Stop -if 17: Analyse and estimate -fitness function for all entities   .18: Terminate -Only if : Criteria condition = Established.

1 :
Initialize: population size = P, Momentum constant  1 and  2 , crossover likelihood   ,, mutation likelihood   , partition numeral   , total variables in every section , total solutions in every section  and highest value of iterations   .

Table 2 .
Results on dataset 1

Table 3 .
Results on dataset 2

Table 4 .
Results on dataset 3