Sarcasm Detection Using Multi-Channel Attention Based BLSTM On News Headline

Sarcasm is often used to express a negative opinion using positive or intensified positive words in social media. This intentional ambiguity makes sarcasm detection, an important task of sentiment analysis. Detecting a sarcastic tone in natural language hinders the performance of sentiment analysis tasks. The majority of the studies on automatic sarcasm detection emphasize on the use of lexical, syntactic, or pragmatic features that are often unequivocally expressed through figurative literary devices such as words, emoticons, and exclamation marks. In this paper, we introduce a multi-channel attention-based bidirectional long-short memory (MCAB-BLSTM) network to detect sarcastic headline on the news. Multi-channel attention-based bidirectional long-short memory (MCAB-BLSTM) proposed model was evaluated on the news headline dataset, and the results-compared to the CNN-LSTM and Hybrid Neural Network were excellent.


Introduction
Text has become a way to express something and convey information from one individual to another because of its certainty and completeness of expression. Text classification has been widely applied in several ways including news classification, emotional analysis and an automatic question and answer system [1].
However, text classification is a fairly broad field and is not only limited to positive or negative classifications as has often been done in recent years because there is one type of text that has the opposite meaning of what is written, namely sarcastic text or sarcasm. "Sarcasm is a specific type of sentiment where a person expresses his negative feelings using positive words in his speech or exaggerates the positive words" [2]. Therefore, to be able to classify texts, researchers must need a deeper understanding of language, system dialogue, and skills to understand the context along with its content. The problem is that in the presence of sarcasm the polarity of the user's expression is reversed the meaning of the misclassification error, thus affecting the performance of the System [3]. This is a difficult task because of the complexity and ambiguity of natural language, where a word may have a different meaning, or several phrases can be used to express the same idea [4]. Because of this, classification of sarcasm texts is still quite difficult in many natural language processing systems because even for humans it is quite difficult to properly verify the occurrence of sarcasm in a statement. As described by Pozzi and colleagues [5], "The difficulty in recognition of sarcasm causes misunderstanding in everyday communication and poses problems to many NLP systems." Many research have tried to identify sarcasm and have proposed various models, in which they have found that sarcastic sentences have certain biases with regard to the sentence itself [6] , while others make claims that this phenomenon is non-literal and its identification requires pragmatic knowledge [7]. Based on existing research, there are 2 types of methods that are often used for research in the field of Natural Language Processing (NLP), namely Convolutional Neural Network (CNN) and Long-term and Short-term Memory Neural Network (LSTM) [8]. In this paper, we present a multi-channel attention based BLSTM sarcasm detection technique in the news headlines that weight in the advantages of multi-channel mechanism, attention mechanism, multi-channel and bidirectional long short-term memory layer (BLSTM). Moreover, we com-pared this approach with other deep learning algorithms to show the performance and accuracy our model has reached so far. We applied our approaches to analyze sarcasm of news headline in news headline dataset.
In the rest of this paper, a brief of related works is summarized in "Related works" section. In "Methods/experimental" section, we represent our methodologies that we have implemented to recognize sarcasm in news headline, with the dataset used. In "Results and discussion" section, a brief discussion of the results is addressed. At the end, insights for the future and a short summary are presented.

Related Works
A lot of researches have been done lately in the domain of natural language processing. In [9] Jain, Kumar, & Garg proposed an RNN approach with Feature-rich CNN to improve the accuracy of sarcasm detection using a dataset of 3.000 sarcastic tweets and 3.000 non-sarcastic tweets in Hindi and English. This model consists of a preprocessing stage where the data is normalized by removing punctuation marks, special characters and regular expressions, then the data is stemmed to return the word to its basic form after the data is tokenized using GloVe for English and Hindi-SentiWordNet for Hindi then the result is an English vector. processed using the BLSTM method then the results enter the soft attention layer stage to see the level of semantic closeness in each word with the intent of the sentence from the processed data then the results are processed again to see the conceptual relationship pragmatically then the results of the English and Hindi vectors enter the convolution stage the layer to produce a feature map then enters the RELU Layer stage to overcome non-linearity in the previous process results then enters the pooling layer stage where the feature map is reduced in dimensions and takes the most influential features then enters The representation layer stages of the previous process are fed into fully connected softmax to calculate the probability of each word and classify the tweets as being sarcastic or non-sarcastic In [10] Mandal & Mahto proposed the CNN-LSTM model approach added with word embedding to improve the accuracy of sarcasm detection by using a data headline dataset that has been categorized manually as many as 26.709 data and divided into 11.725 sarcasm and 14,984 non-sarcasm. The model used consists of a preprocessing stage where the data is normalized by removing punctuation marks, special characters and regular expressions then the data is stemmed to return the word to its basic form after the data is tokenized using a word embedding dictionary that has been created based on the 10.000 words that appear most frequently in the data. sarcasm then the next stage of convolution where the vector is processed at 1-D convolution layer with 32 filters and 7 kernel sizes so that the filter will produce 7 word combinations then the results will go to the 1-D max pooling layer where the results of each kernel are converted into 1 output based on the highest value in each kernel then the results are processed again in the 1-D convolution layer stage after which the data is processed using the CNN-LSTM method with a dropout of 0,5 then in the final stage the results are processed with a binary cross-entropy loss function. The evaluation method used is to measure the level of accuracy where an accuracy of 86.16% is obtained.
Then [11] Mehndiratta & Soni proposed the RNN, CNN model approach, and a combination of the two added with word embedding and hyperparameter tuning to increase the accuracy of sarcasm detection using the Reddit dataset which contains comments that have been categorized manually into the sarcasm category of 1.35 million data from a total of 533 million data. The model used consists of a preprocessing stage where the data is normalized by eliminating punctuation marks, special characters and regular expressions, then the data is stemmed to return the word to its basic form after the data is tokenized using the word embedding dictionary GloVe or fastText and then entered into the classification method using CNN. , LSTM, CNN-LSTM, and LSTM-CNN in which a hyperparameter is added in the form of an epoch and a special dropout after which a comparison report of the accuracy of each method is generated.
In [12] Kumar, Sangwan, Arora, Nayyar, & Abdel-Basset proposed a deep learning model approach called sAtt-BLSTM convNet which is based on a combination of soft attention-based bidirectional long short-term memory (sAtt-BLSTM) and convolution neural network (convNet) to improve sarcasm detection accuracy with 40.000 random data tweets labeled 15.000 sarcastic tweets and 25.000 non-sarcastic tweets. The model used consists of a preprocessing stage where the data is normalized by removing punctuation marks, special characters and regular expressions, then the data is stemmed to return the word to its basic form after the data is tokenized using the GloVe word embedding dictionary and then enter the BLSTM method, which includes soft-attention to combine the results of 2 backward an forward outputs into one then the results enter the attention layer stage to see the level of semantic closeness in each word with the intention of the sentence from the processed data then the results enter the convolution layer stage to produce a feature map then enter the RELU stage The layer to overcome the non-linearity in the results of the previous process then enters the pooling layer stage where the feature map is reduced in dimensions and takes the most influential features then enters the representation layer stage the results of the previous process are entered into fully connected softmax to calculate the probability of each word and classification of tweets as sarcastic or non-sarcastic results.
In [13] Hiai & Shimada proposed an RNN approach with a Relationship Vector to improve the accuracy of sarcasm detection using a dataset of 21.000 sarcastic tweets and 21.000 non-sarcastic tweets. This model consists of a preprocessing stage where the data is normalized by removing punctuation marks, special characters and regular expressions, then the data is stemmed to return the word to its basic form after the data is tokenized using word2vec then the results are reprocessed using the Role Pair Relation Vector method to see the linkage between features. vector based on the previous training data process after which the results are processed using the RNN method with the BLSTM type with vector dimensions 200, 150 hidden, and an epoch size of 30 is added.
In [14] Xiong, Zhang, Zhu, & Yang proposed the Self-matching Networks and BLSTM approach with Low-Rank Bilinear Pooling to improve the accuracy of sarcasm detection using a dataset of 91.717 data. This model consists of a preprocessing stage where the data is normalized by removing punctuation marks, special characters and regular expressions then the data is stemmed to return the word to its basic form after the data is tokenized using GloVe then the results are processed using Self-Matching Network and BLSTM to produce both feature maps. then enter the Low-Rank Bilinear Pooling process to combine and calculate the sarcastic and non-sarcastic probabilities of the incoming data.
In [15] Misra & Arora proposed a Hybrid Neural Network approach to improve the accuracy of sarcasm detection using a news headline dataset of 29.709 data. This model consists of a preprocessing stage where the data is normalized by removing punctuation marks, special characters and regular expressions, then the data is stemmed to return the word to its basic form after the data is tokenized using GloVe then the results are processed using the Hybrid Neural Network to produce both feature maps then combined then calculated sarcastic and non-sarcastic probabilities from the data entered using softmax.
In [16] Kumar, Narapareddy, Srikanth, Malapati, & Neti proposed a BLSTM-based Multi-Head Attention approach to improve the accuracy of sarcasm detection using a dataset of 110.914 sarcastic reddit comments and 173.003 non-sarcastic reddit comments. This model consists of a preprocessing stage where the data is normalized by removing punctuation marks, special characters and regular expressions, then the data is stemmed to return the word to its basic form after the data is tokenized using GloVe then the results are processed using the BLSTM method with 100 vector dimensions, 100 hidden units, and add a dropout value of 0,5 then the results are processed on the Sentence Level Multi-Head Attention Layer to measure the importance value of each word based on the semantic factor after that the results are processed again in Auxiliary Features Concatenation where the initial data is extracted semantic, sentiment, and punctuation and combined. with the result data from the previous process to create a new representation of the data then the processing result is entered into the softmax layer to calculate the probability of each word and classify the data as sarcastic or non-sarcastic.
Some of this work contributions is the use of an end-to-end network that comprises proposed model steps: preprocessing, token vectorizaton, BLSTM, Attention Layer, Pooling, Relu Layer, and Representation Layer. We compare the results with other two deep learning approaches.

Methods/experimental
In this section, we will present the dataset used, and our methodologies to recognize sarcasm on headline news using Multi-Channel Attention Based BLSTM in addition to other two deep learning algorithms, which are CNN-LSTM, and Hybrid Neural Network.

Used dataset
We used the News Headline dataset provided by [15], that classify news headline into sarcasm and non-sarcasm. The dataset consists of 56.418 news headline, that 25.846 represent sarcasm and 30.752 represent non-sarcasm. We split our dataset into 45.352 news headline in the training set and 11.066 news headline in the testing set. We used 80% of the dataset for training and 20% for testing. The training datasets were used to train the classifier and to optimize the parameters, while the test dataset (unseen to the model) was reserved to test the built model, to provide an indication of how good the trained model is. Also, we tried to split the data 70% for training and 30% for testing it gave us the same results which is 97.84%.

Data preprocessing
Before data are transfer to input layer, data are pre-processed to clean and transform the data for feature extraction [17]. The steps we followed are: 1. Converting the entire text in a document into a standard form (in this case lowercase or lowercase). 2. Cutting a document into parts called tokens and remove certain characters that are considered punctuation. 3. Removing stop words that do not contribute much to the content of the text, such as the words "and", "i", "you", "with", "she", "he", and others. 4. Stemming process that the return of an affixed word to its root form.

Proposed Multi-Channel Attention Based BLSTM
The proposed deep learning model uses 6 layers: the input layer, embedding layer, BLSTM layer, attention layer, max pooling layer, ReLu Layer, concatenate layer, and representation layer.

Input Layer
The news headline after pre-processing are fed to the input layer. The input layer is connected to the embedding layer, which builds word embeddings using GloVe and fastText.

Embedding Layer
The embedding layer maps the input into real-valued vectors using encoding from look-up tables. Word embeddings facilitate learned word representations. The benefits of extracting features based on word embedding to detect sarcasm have been recently reported [18]. In this study, to build word embeddings, GloVe and fastText, which generates a word vector table, is used. GloVe and fastText is a count-based model of representing words by feature vectors. This log-bilinear model studies the relationship of words by counting the number of times they co-occur. Thus, this model aids in mapping all the tokenized words in each news headline to their respective word vector tables. Proper padding is performed to unify the feature vector matrix. That is, if the total number of given tweets is Z and there is a tweet X with t tokens, generation of a word vector with dimension d of the word vectors is completed using GloVe and fastText. Thus, for all Z, each t in X is mapped to its respective V. After this mapping, each X is expressed as a vector of the word embeddings = concatenation (E). Thus, the feature vector matrix is obtained as shown in (1).
where C is the concatenation operator of the vector. The tweets are of varying length, so to unify the feature vector matrix representation of news headline, the news headline with the maximum length in the given corpus are used as a threshold value. This is done basically to fix the length of the news headline matrix. Hence, for all the news headline that were shorter than this threshold, zero padding was performed. This matrix was finally fed as input (i.e., F) to the BLSTM layer.

BLSTM Layer
LSTM is considered as one of most successful RNN variants, which introduces three additional gates. In the text mining domain, LSTMs have been involved in the task of sentiment analysis [19], sentence classification [20], etc. When using it for the whole long document, the training of LSTMs is not stable and underperformed traditional linear predictors as shown in [21]. Moreover, the training and testing time of LSTMs are also time/resource consuming for the long document. [21] illustrates that the training of LSTM can be stabilized by pre-training the LSTMs as sequence autoencoders or recurrent language models. However, the above problem is avoided when we use LSTMs for label sequence predictions, which is typically much shorter than a document. The label sequence here is the assignments of ordered labels to a text. Although several variants of LSTMs exists, the standard LSTM is used. An additional word embedding layer is also applied for the labels. LSTM is an RNN that contains special units in the recurrent hidden layer called memory blocks [22]. There is an input, output, and forget gate for each memory cell. The hidden layer of LSTM is also called the LSTM cell. LSTM has the capability to plot long-term dependencies by defining each memory cell with a set of gates <d, where d is the memory dimension of the hidden state of LSTM [23].
Each word in the news headline F is independent of other words, when the words are represented by making use of word embedding E. In this layer, a new representation for each word is achieved by summarizing contextual information from both the directions in a news headline. The bidirectional LSTM is a combination of forward LSTM (2) and backward LSTM (3), which reads the comment from x n to x 1 : We concatenate forward hidden state and backward hidden state to obtain hidden state representation h t for each word x t . Then, h is calculated using (4) [24]: Where, hi is the output of the i-th word ⊙ function is a concatenation function used to combine the two outputs. Generally, different merge modes can be used to combine the outcomes of the Bi-LSTM layers. These are concatenation (default), multiplication, average, and sum. ℎ t , represents the output sequence of the forward layer which is calculated iteratively using inputs in a positive sequence from time t-n to time t-1, ℎt represents the output sequence of the backward layer which is calculated using the reversed inputs from time t-n to t-1 This process helps in capturing information of whole sentence around every word xt. We denote all the hidden state of the words xt as H ϵ ℝ N×2p , where size of -h forward and h backward be p.

Attention Layer
In the text analysis task, the attention model is used to represent the correlation between the words in the text sentence and the output result. The model is first applied to the task of machine translation. The feed-forward attention model [25] adopted in this paper is a direct simplification of the conventional attention model. The simplification method is to construct a single vector c from the whole sequence., constructed as follows: Where a is a learning function, and now it is only determined by h t . In the above formula, the attention mechanism can be considered as constructing a fixed length of the embedded layer c of the input sequence by calculating an adaptive weighted average of the sequence of states h. We obtain the final sentence-pair representation used for classification from: h * = tanh (c)

Max Pooling
The output from the proposed convolutional layer then undergoes 1-dimensional max pooling. This layer converts each kernel size of the input into a single output by selecting the maximum value observed in each kernel. Pooling is used to reduce overfitting; this allows to add more layers to proposed architecture and in turn allows the neural network to extract higher-level features.

ReLu Layer
The activation or ReLU Layer [26] is applied for dealing with the nonlinearity in the model. It generates a rectified feature map, which is fed to the concatenated layer to combine 2 matrix Glove and FastText.

Concatenated Layer
In the Concatenated Layer, the output results from the two ReLu Layers with different word-embedding will be combined to form a larger matrix S all Where f g is the vector result of the relu process with the word embedding GloVe and f s is the vector result of the relu process with the word embedding fastText, which is fed to the fully connected layer.

Representation Layer
The output layer is a fully connected layer that consists of the softmax activation function. The concatenated feature map is input to the fully connected softmax layer, which calculates the probability of any output word and classifies the news headline as sarcastic or non-sarcastic as an output. The output vector of the softmax layer P i is as (10).
Where P i is the probability of whether the input is sarcastic or not, and W c is the weight matrix and b c is the offset value.

Results and discussion
For discussing the results, the empirical analysis has been broadly divided into two parts: (i) parameter setting for the proposed model and (ii) comparison with multiple baselines on the basis of classification accuracy

Parameter Setting
Optimal selection of parameters is imperative to achieve superlative performance results. We use the validation data to tune the hyper-parameters so as to obtain the best results. Table 1 lists the values used in this work.

Performance Result
The proposed model is evaluated to predict sarcasm in news headline using one datasets containing a total of 56.418 news headline. The results have been assessed using key performance indicators (accuracy, recall, precision, and F measure ([27] [28]) Table 2 lists the results of the proposed MCAB-BLSTM model implemented on the datasets.

Comparison With Other Deep Learning Models Result
We compare the results of the proposed model with three other deep learning architectures, namely, CNN-LSTM, and Hybrid Neural Network. The word embedding was performed using GloVe for each baseline model and the evaluation was made for both datasets using the four key performance indicators. The results obtained for CNN-LSTM, and Hybrid Neural Network are shown in Tables 3 and 4, respectively  It can be clearly observed that the proposed MCAB-BLSTM outperforms the other models with an accuracy of 96.64% achieved for the news headline dataset. CNN-LSTM shows the least accuracy of 86.16% for the news headline news datasets, respectively. The models, in order from lowest to highest accuracy, are CNN-LSTM < Hybrid Neural Network < MCAB-BSLTM. The best recall is also observed for the proposed MCAB-BLSTM for the news headline datasets with value of 97%. The best MCAB-BLSTM also model demonstrates the best precision value of 97%. Table 5 summarizes the comparison of the accuracy results obtained by the above three models. Fig. 3 graphically illustrates these comparative results.

Conclusions
This paper proposes a sarcasm detection model Multi-Channel Attention Based BLSTM (MCAB-BLSTM). The model is based on the characteristics of text sentiment analysis. Firstly, the 2 words embedding is constructed and then input into the Attention Based BLSTM then process to ReLu Layer and we concatenate 2 output matrices for training, which effectively solves the long-term dependence and gradient dispersion of the other deep learning model in the training process. Experiments show that this method can significantly improve the sarcasm detection effect, and the accuracy rate reaches 96.64%. In the future, on the basis of this paper, exploring the influence of different forms of neural networks on the model, and trying to introduce more attention model structure and optimizing the sarcasm detection model will become the direction of further research.

Funding
There is no funding.

Availability of data and materials Dataset.
Dataset from Kaagle