Deep Learning Approach for Classifying the Aggressive Comments on Social Media: Machine Translated Data Vs Real Life Data

Aggressive comments on social media negatively impact human life. Such offensive contents are responsible for depression and suicidal-related activities. Since online social networking is increasing day by day, the hate content is also increasing. Several investigations have been done on the domain of cyberbullying, cyberaggression, hate speech, etc. The majority of the inquiry has been done in the English language. Some languages (Hindi and Bangla) still lack proper investigations due to the lack of a dataset. This paper particularly worked on the Hindi, Bangla, and English datasets to detect aggressive comments and have shown a novel way of generating machine-translated data to resolve data unavailability issues. A fully machine-translated English dataset has been analyzed with the models such as the Long Short term memory model (LSTM), Bidirectional Long-short term memory model (BiLSTM), LSTM-Autoencoder, word2vec, Bidirectional Encoder Representations from Transformers (BERT), and generative pre-trained transformer (GPT-2) to make an observation on how the models perform on a machine-translated noisy dataset. We have compared the performance of using the noisy data with two more datasets such as raw data, which does not contain any noises, and semi-noisy data, which contains a certain amount of noisy data. We have classified both the raw and semi-noisy data using the aforementioned models. To evaluate the performance of the models, we have used evaluation metrics such as F1-score,accuracy, precision, and recall. We have achieved the highest accuracy on raw data using the gpt2 model, semi-noisy data using the BERT model, and fully machine-translated data using the BERT model. Since many languages do not have proper data availability, our approach will help researchers create machine-translated datasets for several analysis purposes.


I. INTRODUCTION
The invention of the World Wide Web (WWW) had a significant influence on social media as users may share a variety of content, including instructive, entertaining, and personal information, very fast without being present physically but using only a digital device [1]. Facebook, Twitter, Instagram, and YouTube are the most widely used social media networks [2]. These platforms enable users to share their ideas, knowledge, and point of view. However, those platforms also have a negative side [3]. In some cases, freedom in digital social media results in detrimental effects, when not wellused, including despair, depression, and even suicide [4,5]. As a result, social media is becoming riskier for users and may even encourage some to end their lives. Therefore, hate speech has been the subject of numerous investigations to determine the causes and actions against online hostility .
Cyberbullying is the behavior of repetitive hurting of an individual or group of individuals by the dissemination of offensive content or the use of other types of social violence through the use of digital media [6][7][8]. Mostly, teenagers experience cyberbullying on social media [9]. One study has shown that 36.5% of students have dealt with cyberbullying at least once in their lives, and among all the other forms of online comments, rude or cruel remarks were the most prevalent. Another study discovered that out of 1,501 adolescents in the USA between the ages of 10 and 17, 12% admitted to abusing someone online, 4% admitted to being the victims, and 3% admitted to being both the aggressor and the victim of cyberbullying [10]. A survey by Sri Lanka's law enforcement agency, the Cyber Crimes Division (CID), reveals that more than 1000 incidences of cyberbullying were recorded there. Over 90% of university students reported having experienced cyberbullying, and almost all poll respondents stated they knew someone who had been bullied online. 80% of the cyberbullying experienced by Sri Lankans occurred on Facebook. Inconvenient videos or photographs have been posted online by 65% of college students. 15% of users posted personal information online, 9% disseminated inaccurate information about others and lies, and 2% posted offensive material [11]. A broad audience and extended visibility period come with transmitting cyberbullying faster and easily, which is a big problem nowadays [12]. It becomes an everyday occurrence, and victims face it repetitively, which causes both mental and physical issues [13]. Schneider et al. [14] showed a relationship between victimization and five categories of physical distress in a survey of MetroWest Adolescent Health by collecting information from over 20,000 pupils. Among the cyberbullying victims, self-harm (24%) and depressed symptoms (34%) were the highest rates of psychological distress.
Since the number of users grows daily, a thorough investigation is now required to address the problem of cyberbullying. Previous investigations in this field lack many criteria, such as accurate detection with effective algorithms and data unavailability for training the advanced AI technology, which is crucial to address as soon as possible [15,16]. Previously, Perera et al. [11] conducted an investigation to detect cyberbullying on social media. However, they used low instances of a dataset ( 1000 labeled text data) for making classification with classifiers such as Support Vector Machine ( SVM), which resulted in a very low accuracy ( 74%). Later, Alotaibi et al. [17] proposed a multichannel deep learning framework for cyberbullying detection on social media using a 55,788 Twitter dataset. They developed Multichannel deep learning, which did not provide satisfactory results ( 88% accuracy). Finally, Ahmed et al. [18] used Deep Neural Network to detect cyberbullying on social media using 44001 comments From Facebook. They also tried to develop a Hybrid Neural Network, but the result was ineffective ( 85% accuracy).
However, countries like Bangladesh and India lack proper investigations due to the lack of data availability. None of the investigations has been done to resolve the data unavailability issues. Therefore, those countries are more vulnerable to cyberbullying and cyber aggression due to the lack of research and inappropriate action. We have focused on Hindi, Bangla, and English to detect aggressive comments on the social media platform. We have used the TRAC-2 dataset, which contains English, Bangla, and Hindi comments.
Furthermore, we have tried to resolve the data unavailability issues by creating a fully machine-translated English dataset. Since the dataset is very important to learn a model for detecting aggressive comments, if one language lacks the dataset, it may become impossible to solve the aggressive comment issue for that language. Google Translator can translate data from one language to another but contains a lot of noises that may not be appropriate for training a deep learning model. In this paper, we have shown how the deep learning models perform on machine-translated noisy datasets and compared the result with the raw and semi-noisy datasets, which have provided a clear observation of how the deep learning models perform on noise-free, semi-noisy, and fully-noisy datasets. The raw dataset contains an imbalanced data issue. We have used a machine translated augmentation process to avoid the imbalanced data problem and constructed a semi-noisy dataset. After constructing the semi-noisy and fully-noisy datasets, we extracted features using the Bert embedding model. The extracted features have been fed to the deep learning models such as LSTM, BiLSTM, LSTM-Autoencoder, Word2vec, BERT, and gpt2 model for classifying the aggressive comments. We have checked the performance of the models using performance metrics such as f1-score, precision, recall, and accuracy.
All of the evaluations have been done on the unseen dataset. We have got the highest result accuracy of 80 percent on English raw data and 73 percent on Bangla raw data using the gpt2 model. For the semi-noisy dataset, we have got highest accuracy of 75 percent on English, 71 percent on Bangla, and 68 percent on the Hindi dataset using the BERT model. For the fully machine-translated English dataset, we have achieved the highest accuracy of 78 percent using the BERT model. The main contributions of this paper can be summarized in three aspects as follows.
(1) A process of generating synthetic data has been shown. We have also shown the performance of traditional models such as LSTM, BiLSTM, LSTM-Autoencoder, Word2vec, BERT, and GPT2 models on synthetic data. We named the synthetic data "noisy" since the data is machine translated and contains many noises.
(2) Extensive experiments on three kinds of datasets: noisy, semi-noisy, and noise-free using traditional models such as LSTM, BiLSTM, LSTM-Autoencoder, and Word2vec, BERT, and GPT2 has been performed to make a comparison using the evaluation metrics such as F1-score, precision, recall, and accuracy. The semi-noisy data refers to the combination of noisy data and raw data. This process is essential to compare different neural network models using different intensities of noise levels present in the dataset. Following our approach, suicidal activities may reduce to an extent by detecting the aggressive comments and taking action based on the prediction. Our approach can be useful for languages that lack data availability.
The rest of this paper is arranged as follows. Section 2 provides the background needed for the study. The data sources, preprocessing methods, and models used in this work for aggression detection tasks are discussed in Section 3. The simulation results based on the classification algorithms and the comparison using the derived results are analyzed in Section 4. Finally, this paper is summarized in Section 5.

II. RELATED WORK
Deep learning models such as Word2vec, LSTM, BiLSTM, BERT, XLM-Roberta, and FastText are popular models for dealing with textual data. Some models can capture the true meaning of the sentence very well; some require a lesser computational cost. Many of these models have been used for cyberbullying detection. Therefore, we have tried to go through very recent as well as primitive investigations that happened, particularly in cyberbullying field.
Previously Perera et al. [11] showed an approach for accurate cyberbullying detection and prevention on social media using 1000 manually labeled texts from Twitter. They labeled the dataset manually as some comments may contain slang words but still can be non-bullying comments, e.g. "you have done fucking well in the exam". They tried to understand the true meaning and annotated it accordingly. They used Support Vector Machines (SVM) for classification and Logistic regression to select the best combination of features. Their proposed solution provides 74% accuracy.
Simon et al. [19] showed a systematic review of machine learning trends in the automatic detection of hate speech on social media platforms. A total of 31,714 articles from 2015 to 2020 were examined; 41 papers were included based on inclusion criteria, while 31,673 papers were excluded according to exclusion criteria. This study concluded that machine learning and deep learning are the most successful methods for classifying hate speech on social media. Moreover, they found that many researchers used the support vector machine algorithm for classification, while the deep learning models are also gaining popularity daily. A similar systematic review was shown by Castaño-Pulgarín et al. [20]. They found 67 studies eligible for analysis out of 2389 papers in the online search. They showed a qualitative study but did not make any analysis of technical approaches. The results showed that the victims are mainly from Muslim countries, and the abuser targets the Muslim religion for making hate speech.
Roy et al. [21] used Multilingual Transformers for hate speech detection. They examined the issue of identifying offensive and hateful words on Twitter. They specifically tried to address two classification issues. First, categorize each tweet as either hostile and insulting (class HOF) or not (class NOT). Second, categorize it into one of the following three categories: hate speech, offensive, and profanities (HATE, OFFN, PRFN). They used the XLM-Roberta classification model. They achieved Macro F1 scores of 90.29, 81.87, and 75.40 for English, German, and Hindi, respectively, while performing hate speech detection and 60.70, 53.28, and 49.74 during fine-grained classification.
Alotaibi et al. [17] proposed a multichannel deep learning framework for detecting cyberbullying on social media. This method divides Twitter comments into categories such as aggressive and non-aggressive categories. They classified the comments using algorithms such as transformer, bidirectional gated recurrent, and convolutional neural network. The effectiveness of the suggested strategy was evaluated using a combination of three well-known hate speech datasets. The proposed approach had an accuracy rate of about 88%.
Sadiq et al. [22] used a deep neural model to detect aggressive comments on Twitter. They used a Multilayer Perceptron and fed manually engineered features onto it. They also experimented with a cutting-edge CNN-LSTM and CNN-BiLSTM deep neural network combination; both of them worked well. Their statistical findings demonstrated that the proposed model worked best with 92% accuracy detecting aggressive behavior.
Ahmed et al. [18] used Deep Neural Network to detect cyberbullying on social media. The dataset they used comprises of 44001 user comments from Facebook sites. They categorized the datasets into categories such as religious, troll, threat, non-bully, and sexual and preprocessed the information to remove errors like incorrect punctuation and flawed characters before feeding it into the neural network. The pre-processing procedures were carried out in three stages: removing stop words, tokenizing string, and converting padded sequence. Their model consists of three parts: 1. identifying harassment related comments which contains descriptors such as threat, troll, and religious as bully, 2. using hybdrid classification model for categoeizing all five classes, 3. using an ensemble approach for increasing accuracy by pooling the predicted results from the multiclass classification models. The model provides 85% accuracy, while their binary classification model provided 87.91% accuracy.
Kumar and Sachdeva [23] showed a Bi-GRU with attention and CapsNet hybrid model for cyberbullying detection on social media. They demonstrated their proposed model's result and showed that for MySpace and Formspring, the F-score has improved by almost 9% and 3%, respectively.
Alam et al. [24] showed an ensemble-based machine learning approach for detecting cyberbullying. They developed both single and double-voting models to classify offensive and nonoffensive comments. The dataset collected from Twitter. To compare their result they used four machine learning models, three ensemble models, and two feature extraction methods. Moreover, they coupled various n-gram analyzers with those models. The result showed that their proposed SLE and DLE voting classifiers performed best among all the aforementioned models. The most outstanding performance for their suggested SLE and DLE models was 96% when TFIDF (Unigram) feature extraction was used with K-Fold cross-validation.
Desai et al. [25] used a machine-learning approach to detecting cyberbullying on social media. They proposed a model based on certain characteristics that should be considered when identifying cyberbullying and applied a few characteristics with the aid of a bidirectional deep learning model known as BERT. They split the features into sentimental, syntactic, sarcastic, semantic, and social. Their suggested model performed more accurately (91.90% accuracy), which was a better result when compared to the typical machine learning models employed on comparable datasets.

A. Dataset Specification
The dataset used in this work is collected from Trac-2 (workshop on trolling, Aggression, and cyberbullying), which contains 25,000 comments from three social media -Facebook, Youtube, and Twitter, in three languages-English, Bengali, and Hindi [26]. The shared task has two groups: Sub-Task A (Aggressive comments) and Sub-Task B ( misogyny comments). Sub-Task A comprises three classes: Non-Aggressive (NAG), Overtly Aggressive (OAG), and Covertly-Aggressive (CAG). The indirect aggressive comments are annotated as Covertly-Aggressive (CAG), the direct aggressive comments are annotated as Overtly-Aggressive (OAG), and no aggressive comments are annotated as Non-aggressive (NAG). Likewise, Sub-task B contains two classes: GEN and NGEN. A comment which indicates a man, woman, or transgender is annotated as GEN, and a comment that does not indicate gender is annotated as NGEN. All three Datasets contain both train and test sets. In our project, we experimented with Sub-Task A, since Sub-task A's features fully align with our purpose of prediction, which is cyberbullying detection. The statistics of the dataset provided by the organizations have been shown in Table 1 for Sub-Task A. Some examples of the text data is shown in Figure 1.

B. Data Preprocessing
To create semi-noisy data, we have added noises with the raw data until the dataset resolves imbalanced issues. The data we have used is initially highly imbalanced for sub-task A. The imbalanced data perform poorly for predicting the aggressive comments. The category NAG holds 50 percent of the total text data, and the category OAG and CAG hold the other 50 percent of total text data. To resolve the imbalanced data issue, we have augmented the text data in such a way that all of the classes maintain almost the same amount of text data. We have adopted two methods for the augmentation process-Noise Addition and Data Translation.
Noise Addition: Noises are added by replacing a word with synonyms or antonyms, adding random stop words, and shuffling some words on raw text data. The process helps to increase the corpus size while keeping the context the same as the raw dataset.
Data Translation: Data have been translated from one language to another, e.g. English to Bangla, using google translator. All the languages have the same sub-tasks and classes. So, we have translated the texts with all possible combinations until the dataset reaches a balanced position. We translated the texts for NAG, OAG, and CAG classes from Sub-Task A.
We have added texts from the noise augmentation process and texts from the translation augmentation process into the raw data so that the new corpus holds an almost equal number of text data for Sub-task A. The statistics of the dataset after adding the augmented data with raw data shown in Table-2 for Sub-Task A.

C. Fully Machine Translated Data
Using the translation augmentation process, we have generated a complete machine-translated English dataset which is fully noisy. We have translated the Bangla and Hindi Sub-Task A texts into English for creating the new dataset. The statistics of the fully translated data is shown in Table 3 for Sub-Task A.  Some examples of the Machine translated English data is shown in Figure 2.

D. Input Representation
The raw text, semi-noisy, and fully noisy datasets are converted into a machine understandable number representation. The computer can only understand the numbers; so, it is necessary to convert the text into numbers before feeding it into the models. We have used a BERT tokenizer for BERT models and TensorFlow.Keras tokenizer for Autoencoder, LSTM, and BiLSTM models.

E. Classification Models
The LSTM, BiLSTM, LSTM-autoencoder, Word2vec, BERT models, and GPT-2 have been applied to the text data of the Trac-2 workshop. All of the models are individually applied on English, Bangla, and Hindi datasets. For training the models, the dataset has been split into two parts: training layer [27,28]. Figure 3 depicts the LSTM's structural layout. The elementary state of RNN architecture is shown as the mathematical function: (1) Here, θ denotes the function parameter, ht denotes the current hidden state, xt denotes the current input, f denotes the function of previous hidden state.
When compared to a very large dataset, the RNN architecture's weakness is the tendency to forget data items that are either necessary or unneeded. Due to the nature of time-series data, there is a long-term dependency between the current data and the preceding data. The LSTM model has been specifically developed to address this kind of difficulty. It is first proposed by Hochreiter Long [29]. This model's main contribution is its ability to retain long-term dependency data by erasing redundant data and remembering crucial data at each update step of gradient descent [30]. The LSTM architecture contains four parts: a cell, an input gate, an output gate, and a forget cell [31]. The purpose of the forget cell is to eliminate extraneous data by determining which data should be eliminated based on the state (t) − 1 and input x(t) at the state c(t) − 1. At each cell state, the sigmoid function of the forget gate retains all 1s that are deemed necessary values and eliminates all 0s that are deemed superfluous. [29,32,33]. The forget gate state's equation is stated as follows: (2) where ft denotes sigmoid activation function, ht−1 denotes output from previous hidden state, Wf denotes weights of forget gate, bf denotes bias of forgetting gate function, and finally xt dentoes current input.
After erasing the unneeded value, new values are updated in the cell state. Three steps make up the procedure: The first step is deciding which values need to update using sigmoid layer called the "input gate layer". Second, creating a vector of new candidate values using the tanh layer.

) The addition of C(t) * it and Ct − 1 * f t updates the new cell at state C(t). The updated state's equation is:
Ct = Ct−1 * ft + C(t) * it (5) In order to determine which output needs to be maintained, the output is ultimately filtered out using the sigmoid and the tanh functions.
ht = Ot * tanh(Ct) (7) In this state, ht gives outputs that are used for the input of the next hidden layer.
2) BiLSTM: The Bidirectional Long short-term memory (BiLSTM) was first proposed by GRAVES [34]. The architecture of BiLSTM can learn patterns from both past-to-future and future-to-past data. This idea sets it apart from the LSTM model, which can learn patterns from the past to the future. Figure 4 depicts the bidirectional LSTM's structural layout. The backward propagation layer primarily functions as a forwarding LSTM reverse layer. The hidden layer synthesizes information from both the forward and backward directions [35]. As a result, "the reverse direction of forwarding direction" is used to calculate the LSTM reverse layer. The formula for computing the BiLSTM network is: 3) LSTM-Autoencoder: An LSTM Autoencoder is an autoencoder implemented for sequential data by following an Encoder-Decoder LSTM architecture. For a given sequential dataset, LSTM-Autoencoder is designed to read the input sequences, encode the sequence, decode the sequence, and reconstruct the sequence. The model's performance is estimated based on the ability to how correctly it can reconstruct the sequence. LSTM autoencoder can be used on video, text, audio, and time-series sequence data [36].

4) Word2vec:
Word2vec is a word embedding model which deals with textual data. The representation of a word is very complex and can not be understood by machine learning algorithms. Therefore, word embedding makes it easy to align the presentation of the words in such a way that it preserves each word's meaning. The model maps each word into vectors of real numbers using a neural network model and is capable of capturing a long sequence of semantic and syntactic relationships. The word2vec is built with a twolayered neural network. It can detect synonymous words and suggest additional words for partial sentences [37]. Based on a corpus of text, they are used to build and train semantic vector spaces, which frequently have several hundred dimensions. Each word from the corpus is represented as a vector in this space. In this area, words that have similar contexts are situated next to one another. Two methods can serve as the foundation for the word2vec architecture: a continuous bagof-words model or a continuous skip-gram model (CBOW). While the latter uses the current word to predict surrounding words, the former uses the context to predict the current word. Both models have a low computational complexity, making it Ot inpt possible for them to quickly process a corpus with a size in the billions of words. Although CBOW models are quicker, skip-gram has been found to perform better on short datasets. Therefore, we decided to use the latter model. 5) BERT: Bidirectional Encoder Representations from Transformers (BERT) is a pre-trained model which has been trained on a sizable unsupervised Wikipedia or Book corpus. It ensures a deeper sense of language context as it learns text sequence from left to right or a combination of left to right and right to left. Therefore, it doesn't stick with one single The mathematical equation [35] of backward propagation is as follows: Where, hf denotes as the forward layer's output, and hb denotes as the reverse layer output.
The output derived from the hidden layer is given below: direction [38]. A pre-trained BERT model can be improved with just one more layer to provide cutting-edge outcomes in a variety of NLP tasks. For BERT models, there are two variations: BERTBASE and BERTLARGE. This paper has used two types of BERT models: BERT base and BERT MultiLingual. Since the BERT model is trained on English text data, leaving low-resource languages such as Bangla and Hindi behind. Whereas, the Multilingual Bert model was trained on Wikipedia content with a shared vocabulary across all languages, which supports 104 languages. Bangla and Hindi dataset has been classified using the Multilingual Bert model [39,40].
Input Format: A certain format for the input token sequence is required by BERT. Every sequence should begin with a [CLS] (classification token), and each sentence should be followed by a [SEP] (separation token). The sequence embedding that can be used to classify the entire sequence is the output embedding that corresponds to the [CLS] token. 6) GPT-2: Generative pretrained transformer-2 (GPT-2) is one of the most standard states of the art generative modeling transformers. It has been trained on a large web text corpus. F1 score: F1 score combines precision and recall and provides an overall measure of the models' accuracy. The value of the F1 score lies between 1 and 0. If the predicted value matches with the expected value, then the f1 score gives 1, and if none of the values matches with the expected value, it gives 0. The F1 score can be calculated as follows: 2 · precision · recall GPT2 is mostly used for the next sequence prediction, question answering, sequence classification, abstract, or text summarization. GPT-2 is known as a transformer decoder as it does not construct with lots of encoders; instead, it relies mainly on decoders as its main structural framework. GPT2 has many variances, such as GPT-2 SMALL, GPT-2 MEDIUM, GPT-2 LARGE, and GPT-2 EXTRA LARGE. We have used GPT-2 MEDIUM for classifying the aggression detection [41].
Accuracy : Accuracy determines how close the predicted output is to the actual value.

F. Evaluation metrics
Evaluating a model's performance is necessary since it shows how close the model's predicted outputs are to the corresponding expected outputs. The evaluation metrics are used to evaluate a model's performance. However, the evaluation metrics differ with the types of models. The types of models are classification and regression. Regression refers to the problem that involves predicting a numeric value. Classification refers to the problem that involves predicting a discrete value. The regression problem uses the error metric for evaluating the models. Unlike the regression problem model, The classification problem uses the accuracy metric for evaluation. Since Our motive is to classify the aggressive comments, we used Accuracy and F1 score as the main Evaluation metric [42].
Precision : When the model predicts a positive result, it should specify how much the positive values are correct. Precision is used when the false positives are high. In aggressive detection classification, if the model gives low precision, many comments are said to be aggressive; for high precision, it will ignore the False positive values by learning with false alarms. The precision can be calculated as follows: TP here, TN refers to True Negative and FN refers to False Negative.

IV. RESULT AND DISCUSSION
Sub-Task A has been considered for classifying the aggression comments on social media. Since the dataset is imbalanced, several data preprocessing methods is adopted before classifying the aggression detection. Data augmentation, such as machine translation and noise addition, are used to resolve the imbalanced data issues. Finally, a fully machine-translated English data has been created to check the performance of existing deep learning models on machinetranslated data that contains most noises. We have trained five Precision = TP + FP (11) deep learning models: LSTM, BiLSTM, LSTM-autoencoder, Word2vec, BERT transformer, and GPT-2 models using all datasets. The models are evaluated using evaluation metrics such as F1 score, accuracy, precision, and recall. We derived the metric evaluation result over the unseen test dataset. The Recall : Recall is the opposite of Precision. Precision is used when the false Negatives are high. In the aggressive detection classification problem, if the model gives low recall, many comments are said as non-aggressive; for high recall, it ignores the FalseNegative values by learning with false alarms. The recall can be calculated as follows: TP + FN performance of the models is shown in Table 4 and Table 5, and Table 6. The classification results for Autoencoder, LSTM, BiLSTM, word2vec, and Bert models using raw data, raw data with augmented data, and machine-translated English data are shown below: We have observed that the gpt2 model performed best on English and Bangla raw data, whereas the BERT model performed best on Hindi raw data. Gpt2 gave the highest accuracy of 80 percent on English raw data. However, the Bert model performed best on the augmented and machinetranslated datasets. It gave 0.78 percent accuracy on the machine-translated dataset. From the experiment, we have found that the BERT model performed best on noisy datasets, while the gpt2 model performed best on the raw dataset that does not contain any noise. It is aparent that, using machine translated data set is quite risky since the dataset contains noises and requires human intervention, which is costly and time-consuming. This work shows that the BERT model can work well on the noisy dataset. We evaluated the model with unseen raw data and got 78 percent accuracy, which is pretty good for industrial and future investigation purposes. For the fully noisy dataset, the training-validation accuracy curve is repre-sented in Figure 5. For all models, the training accuracy is higher than the validation accuracy, representing that the model has learned the dataset properly without being overfitted or under-fitted. Figure 5 illustrates a high-level schematic representation of the model accuracy curve which is derived from BERT model using fully machine translated data . The confusion matrix shows the number of True positive and False negative results has been predicted by each of the model. Figure 6 illustrates a high-level schematic representation of the Confusion matrix for Bert model using fully machine translated.

V. CONCLUSION
In this work, we have presented the process of generating machine-translated data and how the deep learning models perform on this dataset. We have made a comparative analysis with the performance of the models on raw data and seminoisy datasets. The raw data denotes the dataset we have collected from the organization. The semi-noisy dataset denotes the augmented data we have added with raw data. We have used models such as LSTM, BiLSTM, LSTM-Autoencoder, word2vec, BERT, and GPT-2 and evaluated the performance of the models using performance metrics such as Accuracy, F1-score, Precision, and recall. The Performance metric shows that the BERT model performed Best on the Machine translated and semi-noisy data, and the gpt2 model performed best on the raw dataset. The difference between the accuracy on raw, semi-noisy, and Machine translated is minimal. The highest accuracy for raw data is 80 percent, for semi-noisy data is 78 percent, and for machine-translated 78 percent. It is clear that training the BERT model using machine-translated data gives almost the same result as the raw dataset, which may be useful for the dataset that lacks a large dataset. Using our approach, future researchers will be able to analyze various problems associated with text datasets that were left behind due to the availability of datasets availability.