Using a variety of social media platforms, including WhatsApp, Facebook, Instagram, TikTok, and YouTube, this article offers techniques for cyberbullying identification.
The cyberbullying detection framework presented in the block diagram comprises two major components: Natural Language Processing (NLP) and Machine Learning (ML). Real-time data, extracted from platforms such as Twitter, WhatsApp, Facebook, Instagram, TikTok, and YouTube, undergoes a meticulous preprocessing phase to eliminate unnecessary characters. This involves tasks like removing hashtags, stopwords, numeric data, and hexadecimal patterns, as well as converting text to lowercase. The subsequent NLP techniques, including Tokenization, Lemmatization, and vectorization, prepare the data for ML algorithms. The machine learning pipeline involves data collection from diverse sources, text cleaning to remove noise, tokenization for analysis, and stopword removal to reduce dataset noise. Feature extraction incorporates TF-IDF vectorization and word embeddings for comprehensive representation. For multimodal analysis, image processing and feature concatenation combine textual and visual features. Model selection includes supervised learning models like SVM, Logistic Regression, such as RNNs or Transformers. Ensemble methods enhance model robustness. Training involves dataset split, hyperparameter tuning, and validation[12].
Model evaluation employs metrics like accuracy, precision, recall, and the confusion matrix. Ethical considerations address bias mitigation and privacy concerns. Deployment involves real-time monitoring and scalability for handling large data volumes. A feedback mechanism allows users to contribute to model improvement, promoting continuous learning to adapt to evolving cyberbullying patterns. This integrated approach aims to create a safer online environment by effectively identifying and mitigating cyberbullying through a systematic and ethical machine learning-based detection system.
A. Dataset Description
Data collection is a foundational aspect of research, enabling the study of specific variables and facilitating predictions. In the context of cyberbullying, constructing representative models relies on a reliable dataset. This study employs its diverse Own dataset, comprising 8455 comments from various social media platforms, to delve into instances of cyberbullying and harassment.
Notably, the dataset reveals a higher prevalence of cyberbullying directed at women (68.1%) compared to men (31.9%), with comments sourced from actors, politicians, singers, and sports personalities.
The labeled dataset designates toxic (-1) and non-toxic (0) comments, with 57.2% classified as bullying sentences and 42.8% as non-bullying sentences—reflecting the diversity of negative language commonly used in daily life indicates toxic i.e bullying and non-toxic i.e nonbullying sentences respectively as shown in Fig. 5.
The study underscores the significance of a representative dataset in understanding cyberbullying patterns, highlighting notable instances of bullying towards well-known individuals. Following data extraction, preprocessing becomes imperative to clean and prepare the dataset for effective detection. Real-world data often contains unnecessary characters, necessitating thorough preprocessing for improved testing and training outcomes. This holistic approach contributes to a nuanced understanding of cyberbullying propagation, emphasizing the importance of addressing gender-specific trends and negative language prevalent in online interactions.
B. Data Cleaning
Particularly vital for cleaning social media user data, this stage addresses various irrelevant elements. Punctuation, special characters, retweet symbols, hashtags, numeric and hexadecimal values, and URLs are removed, as they don't contribute to sentence meaning. Additionally, words with fewer than three letters are eliminated, and all text is converted to lowercase for consistency. This meticulous cleaning process ensures a refined dataset for subsequent classification tasks, enhancing the models' effectiveness in extracting meaningful patterns from social media content[13].
C. Dataset Preprocessing
After the initial data cleaning phase, Natural Language Processing (NLP) techniques are applied to transform raw text into a format suitable for machine learning algorithms. This involves three key processes outlined in Fig:7. First, tokenization is employed to split each phrase in the tweet into smaller chunks, such as sentences, words, and symbols, known as tokens. Following tokenization, lemmatization is implemented to reduce words to their root forms, enhancing the algorithm's comprehension by standardizing inflectional variations. Subsequently, vectorization is performed to convert the text into numerical vectors or real numbers. This process is crucial for enabling machine learning models to process and understand textual information effectively. After data cleaning and pre-processing (as depicted in Fig. 6 and Fig. 7), the dataset is then split into training and testing sets. The testing dataset, crucial for real-time system usage, is extracted from platforms via text mining.
Both datasets undergo preprocessing techniques and are fed into various machine-learning models to facilitate effective classification tasks. This comprehensive approach ensures the model's ability to generalize and make accurate predictions on unseen data.
D. Feature Extraction and Feature Selection
Word embedding is a pivotal technique in machine learning that represents words as vectors in multi-dimensional spaces, crucial for addressing natural language processing (NLP) challenges. The Word2Vec model, featuring a vocabulary of 13,507 not bullying and bullying 17259 words and an embedding dimension of 16, is employed for this purpose. This method proves invaluable for NLP tasks, including identifying related words, semantic grouping, and text classification. Word2Vec, functioning as a two-layer neural network, translates human language into machine language, creating word embeddings applicable to tasks such as text similarity and sentiment analysis. Leveraging approaches like Continuous Bag of Words (CBOW) and Skip-Gram, it predicts target words based on context or vice versa. During training, the model learns embeddings by representing a corpus as an N-dimensional vector. To address variance issues common in neural network models like Convolutional Neural Networks (CNNs), an ensemble learning approach is implemented. Multiple models run concurrently with varying hyperparameters, and their outputs are aggregated using a Random Forest classifier and the Max Voting technique[14]. This ensemble method significantly enhances accuracy while mitigating the trade-off between bias and variance, making it a powerful tool.
E. Metrics and Evaluation
Measuring the percentage of properly anticipated events to all observations, this metric is the most basic way to assess performance. Especially when working with symmetric samples that have almost equal values due to false positives and false negatives, accuracy is crucial. The accuracy of the model is calculated using the categorization data collected during each test phase, and it is stated as follows: Accuracy (%) = (nc)×100%.
• Precision: The positive predicted value is another name for precision. It is the percentage of truly positive predicted positives:
$$\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}=\frac{TP}{TP+FP}$$
• Recall: The percentage of real positives that are anticipated to be positive is known as recall:
$$\text{R}\text{e}\text{c}\text{a}\text{l}\text{l}=\frac{TP}{TP+FN}$$
• F1 Score: The F1 score serves as a comprehensive metric, combining precision and recall, offering a holistic measure of a categorization system's accuracy. It is calculated as the harmonic mean of precision and recall. For binary and multiclass classification, F1 scores, ranging from 0 to 1, are commonly employed to assess predictive performance. The ROC curve visually illustrates the true positive rate (TPR) against the false positive rate (FPR). In this study, the F1 Score derived from the ROC curve was utilized to determine the most effective classification model:
$$\text{F}1 \text{S}\text{c}\text{o}\text{r}\text{e}=\frac{\left(R+P\right)*2}{(R+P)}$$
• Accuracy: Accuracy is the number of correctly classified instances (true positives and true negatives):
$$\text{A}\text{c}\text{c}\text{u}\text{r}\text{a}\text{c}\text{y}=\frac{TP}{TP+FP+TN+FN}$$
F. Applying Machine Learning Algorithm
Following these preprocessing and feature selection stages, a variety of machine learning models were examined, leading to the identification of seven models to be functionally compared. These models were selected based on study findings from multiple writers as well as factors like popularity, usability, and back-end functionality. The many classifiers employed in the study are as follows:
-
Logistic Regression: Logistic regression is a classification model employing a logistic function to represent a binary outcome. In mathematical terms, it utilizes the logistic function to compress the result of a linear equation, constraining it to a range between 0 and 1: 𝑃(𝑥) = 1/1+𝑒−(𝛽°+𝛽1𝑥).
-
Decision Tree Classifier: Decision Trees serve dual purposes in addressing classification and regression challenges. Conceptually, they form a tree-shaped structure, employing tuned parameters for predictions. Employing a top-down approach during training, decision trees effectively analyze datasets, making them versatile tools in both classification and regression scenarios.
-
Random Forest Classifier: Leverages the collective decision-making of numerous decision trees, with the majority vote determining the model's prediction. This approach ensures robustness, scalability, and resistance to overfitting. While fast and easy to interpret, its real-time prediction capability may diminish with a higher number of trees.
-
Multinomial Naive Bayes (NB): Probabilistic algorithm widely applied in Natural Language Processing (NLP). Leveraging Bayes theorem, it predicts text tags, such as those for newspaper articles, by calculating and returning the tag with the highest probability, assuming features are independent.
-
KNeighbors Classifier: K-Nearest Neighbors (KNN) is a straightforward text classification algorithm that determines the class of new data based on similarity measures with existing data. It uses distance metrics to identify the K-nearest neighbors and assigns the most frequent class among them to the new sample[15].
-
Support Vector Machines (SVM): Powerful for text classification due to their ability to find a hyperplane in n-dimensional space, effectively classifying data points. Linear SVMs are often applied to text classification problems with many features. The decision boundary is defined by the equation: f(x) = wT + b, where w is the weight vector, 𝑋 is the data dataset, and 𝑏 is the linear coefficient.
-
Stochastic Gradient Descent (SGD) Classifier: Optimization algorithm employed for minimizing cost functions, commonly used in linear classifiers like SVM and Logistic Regression. It facilitates discriminative learning and optimization, particularly effective for large-scale linear models [16].