The primary aim of this section is to conduct a comparative analysis of the performance of (SVMs (LSTMs), and the Pysentimento framework utilizing BERT for sentiment analysis within the domain of cryptocurrencies. Based on accuracy level, the best model is elected for sentiment analysis. Next, Google trends, volume, and the results of sentiment analysis (positive, negative and neutral) are combined in a table and normalized after obtaining overall sentiment sum.
After that, Pearson correlation coefficient are applied and time series prediction implementing SARIMA model is conducted to unveil which cryptocurrency is the safest for investment during times of geopolitical tension.
6.1 Data preprocessing and training of (CNN-LSTM)
The training and testing dataset is comprised of 50,859 tweets on Bitcoin that have been categorized as ['positive'], ['negative'], and ['neutral']. This dataset has been obtained from the Kaggle website, which is a renowned platform and online community for data scientists and machine learning practitioners (Dundee, 2002). Eighty percent of the data are devoted to training and the remaining portion is dedicated for testing where the training set consists of (40687) tweets, while the testing set comprises (10172) tweets. A larger training set allows the model to learn from a substantial amount of data, capturing patterns and relationships in the tweets that are essential for making accurate predictions. The bar plot indicates that 22937 tweets are labeled as positive, 21939 as neutral and 5983 as negative. This exploratory analysis offers an initial understanding of sentiment distribution prior to model development. (Fig. 1) conveys such a distribution.
Preprocessing initiates with the column "tweet" turning into "clean_tweet" after preprocessing and the column "label" is the one for sentiments. The word count is a common feature employed in text analysis tasks to assess the complexity and content of text data whereas text length column quantifies the length of each tweet in terms of character count. This metric can be valuable for examining the distribution of tweet lengths within the dataset. The mean and standard deviation enable the researcher to determine the correct average of tweet length employing Freedman-Diaconis rule, which was 30.
The training and testing shapes become (40687, 30) (10172, 30). (Fig. 2) shows the data after preprocessing with both metrics while (Fig. 3) uncovers the word cloud, a powerful visual representation of frequently occurring words in text data, generated at this stage since it offers a visual summary, highlighting the most prevalent words within the dataset. It provides insights into the dataset's underlying themes and characteristics. The epochs’ number is 5 and the batch size is 128.
The model in (Fig. 4) undergoes five processing stages. To begin with, an embedding layer maps each of 30 input tokens to a 200 dimensional vector, forming a three dimensional tensor of shape (30, 200). It involves two crucial parameters: maximum features and embedding dimensions. These parameters grant the model flexibility in encoding word information, making it adaptable to the dataset's specific The model in (Fig. 4) undergoes five processing stages. To begin with, an embedding layer maps each of 30 input tokens to a 200 dimensional vector, forming a three dimensional tensor of shape (30, 200). It involves two crucial parameters: maximum features and embedding dimensions. These parameters grant the model flexibility in encoding word information, making it adaptable to the dataset's specific characteristics. Subsequently, the architecture incorporates two convolutional 1D layers, which introduce a spatial understanding of the text data.
These layers employ a set of 1D convolutional filters, alike to sliding windows, to detect local patterns and feature representations within the tweet sequences. Throughout utilizing filters with varying receptive field sizes, the model becomes proficient at recognizing both fine-grained details and broader textual features. The Rectified Linear Unit (ReLU) activation function adds a critical element of non-linearity, enabling the model to capture complex relationships between words. Following the convolutional layers, two MaxPooling 1D layers act as a dimensionality reduction mechanism. These layers systematically downsample the output from the convolutional layers preserving the most salient and informative features while mitigating computational complexity. This process allows the model to concentrate on the most relevant elements of the data, enhancing its efficiency and focus.
The neural network architecture further evolves with the inclusion of an LSTM layer as it excels at capturing sequential dependencies and long-range contextual information within text data. This layer is vital in modeling the temporal dynamics of tweet sequences, understanding how words relate to each other over time, and capturing intricate patterns that might span across the entire sequence. Finally, the dense layer maps the LSTM output to a three dimensional vector using linear activation, forming a tensor of shape (3) where the softmax function transforms the model's internal representations into probability distributions across the three sentiment classes: negative, neutral, and positive (Gaber et al, 2021).
6.2 Classification report of (CNN-LSTM)
Based on Sharma and Sharma (2022), classification reports are essential instruments for assessing the efficacy of models intended to infer attitudes (positive, negative, or neutral) from textual data. These reports provide crucial metrics, each of which provides unique information on how well the model works. The first indicator, precision, measures how well positive predictions are made in comparison to false positives, or how frequently the model accurately predicts positive feelings. The other one, recall, gauges how well the model can recognize real-world positive examples, demonstrating its accuracy in expressing good emotions. Eventually, the third metric, the F1-score, finds a middle ground between recall and accuracy. This middle ground is especially helpful when the distribution of emotion classes is not uniform. (Fig. 5) uncovers CNN-LSTM classification report.
The results demonstrate the ability of the proposed method to classify tweets into three sentiment categories: positive (pos), negative (neg), and neutral (neu). As for negative (neg), the precision for the negative sentiment class is approximately 0.92. This indicates that when the model predicts a tweet as negative, it is correct proximately 92% of the time. The recall, at around 0.95, demonstrates that the model captures about 95% of the actual negative tweets. The F1-score, which harmonizes these metrics, is approximately 0.93, suggesting a strong balance between precision and recall for the negative sentiment class. This means that the model excels in identifying and correctly classifying tweets expressing negative sentiments.
Regarding the neutral sentiment class, the precision is roughly 0.98, indicating that the model's predictions of neutrality are highly accurate. The recall, at approximately 0.97, signifies that the model correctly identifies about 97% of the actual neutral tweets. The F1-score of about 0.98 reaffirms the model's exceptional performance in classifying neutral sentiment, with a balanced combination of precision and recall. This signifies the model's proficiency in distinguishing neutral tweets from others.
The positive sentiment class exhibits similar excellence, with a precision of approximately 0.98, indicating highly accurate positive predictions. The recall, around 0.98, indicates that the model captures about 98% of the actual positive tweets. The F1-score, approximately 0.98, reflects the strong balance between precision and recall for the positive sentiment class. This underscores the model's ability to effectively identify and classify positive sentiment in tweets. The overall model performance is impressive, with an accuracy of approximately 97%. This accuracy demonstrates the model's proficiency in classifying tweets across all sentiment categories.
The macro-average F1-score, at about 0.96, signifies that the model maintains a robust balance between precision and recall for all sentiment classes, considering their individual support levels. Additionally, the weighted average F1-score, also around 0.97, indicates the model's consistency in performance across different sentiment classes, considering their varying proportions in the dataset.
To conclude, the model demonstrates high proficiency in classifying tweets into positive, negative, and neutral sentiment categories, with an overall accuracy of 97%. It upholds robust balanced performance across sentiment classes, as indicated by macro and weighted average F1-scores of 0.96 and 0.97 respectively. The results emphasize the model's effectiveness in sentiment analysis of social media texts.
6.3 Data preprocessing and training of (SVM)
All the steps of this phase are similar to those in the previous model apart from minor differences. First of all, tokenization breaks the text into individual words or tokens, enhancing the model's ability to comprehend and analyze the content. Secondly, stop-words removal is essential for eliminating common but uninformative words. Thirdly, lemmatization reduces words to their base or root forms, ensuring consistency in the data. Variations like "running" and "ran" are both transformed to "run," streamlining feature extraction and pattern recognition. Finally, Part-of-Speech Tagging (POS) enriches the data by assigning grammatical labels to each word, allowing for the capture of specific linguistic patterns. This step can aid in discerning verbs, adjectives, or other parts of speech relevant to sentiment analysis.
For this model, the Hugging Face 200-dimensional GloVe feature is imported, and it has been widely implemented in NLP tasks. The 200 dimensions of the word vectors refer to the fact that each word is represented by a vector of 200 numerical values to provide a rich and informative representation of words. The feature works by converting words into a continuous vector space, where the similarity between words can be measured using cosine similarity (Pennington et al., 2014).
On the positive side, this feature is built upon a vast and diverse corpus of tweets, enabling it to effectively capture informal, colloquial language, slang, and even emoticons commonly found in social media conversations and tweets. It excels at capturing both global and local information from the corpus, including word frequency, word order, and word context. Additionally, the model can unveil intriguing linear relationships between words, such as analogies, antonyms, and synonyms. However, there are certain restrictions associated with the Hugging Face GloVe embeddings. Firstly, its performance is contingent upon the vocabulary size and coverage of the underlying corpus, which may not encompass rare or domain-specific (cryptocurrency) words, potentially limiting its applicability in specialized contexts. Secondly, the feature may struggle to capture intricate and nonlinear relationships between words, including polysemy, homonymy, irony, and sarcasm. Lastly, it may encounter difficulties in handling out-of-vocabulary words or common typographical errors commonly encountered in the informal language of tweets. Understanding these merits and demerits is crucial when considering the application of the Hugging Face GloVe embeddings 200d in various natural language processing tasks (Pennington et al., 2014).
Conversely, (TF-IDF) is a fundamental technique in (NLP). Being part and parcel of this model, TF-IDF addresses a critical challenge in NLP which is how to represent the inherent information and nuances of text data in a way that algorithms can effectively process (Sharma et al., 2023). TF-IDF vectorization offers several advantages when compared to alternative text representation methods like bag-of-words or word embeddings.
First of all, it addresses the issue of high dimensionality by considering only words that appear in at least one document within the dataset. TF-IDF aids in solving the obstacle of having too many different words to deal with by only looking at words that appear in at least one document in the dataset. This reduction in dimensionality makes it computationally efficient and manageable. Secondly, TF-IDF captures both local and global information about words. It accounts for word frequency within a document and across all documents in the dataset, providing a holistic view of word importance. Thirdly, TF-IDF assigns higher weights to words that carry more informative or distinctive characteristics for a document or topic, while assigning lower weights to common or generic words. This feature is particularly valuable for highlighting the significance of words in context. Finally, TF-IDF is known for its simplicity in implementation and interpretation, as it does not necessitate complex mathematical operations or external resources (Mikolov et al., 2013).
Based on Bird et al. (2009), there exist certain limitations and challenges associated with TF-IDF vectorization. To begin with, TF-IDF assumes that words are independent of each other, disregarding their order and context within a document. This can lead to the loss of valuable sequential information. Moreover, TF-IDF does not capture semantic or syntactic relationships between words, such as synonyms, antonyms, or grammatical structures, which limit its ability to understand the deeper meaning of language. Additionally, TF-IDF may assign low weights to words that are relevant but infrequent across all documents, such as proper nouns or domain-specific terms. Eventually, TF-IDF can be sensitive to outliers or noisy data, including spelling errors or typos. To mitigate some of these limitations, the researcher employs complementary techniques like stemming, lemmatization, NLTK stop-word removal, and adding (pos) to tokens with TF-IDF.
These enhancements improve its overall performance and accuracy, making it a versatile choice for text analysis tasks and that has already been put into practice at the first step of pre-processing. (Fig. 6) uncovers the tweets after preprocessing steps executed. The input data shapes are examined, with train GloVe at 40,687 samples by 200 features and test GloVe at 10,172 by 200, encoding tweets in a semantic space. Meanwhile, train TF-IDF and test TF-IDF have the same sample counts and 9991 features, representing tweets high-dimensionally. This conveys the input data shapes of GloVeand TF-IDF. GridSearchCV is a tool that performs an exhaustive search over specified parameter values for an estimator using cross-validation. Cross-validation is a technique that splits the data into k folds, utilizes one-fold as the test set and the rest as the training set, and repeats this process k times, averaging the results. This approach can provide a more reliable estimate of the performance of the estimator on unseen data.
Three-fold cross-validation is employed to evaluate each combination of parameter values. This conveys that data will be split into three parts, and each part is used as a test set once, while the other two parts are employed as a training set. The average score across the three folds is used as the performance metric for each combination. The parameters that are tuned are as follows: C, which is the regularization parameter for SVM that controls the trade-off between margin maximization and error minimization; and kernel, which is the kernel function for SVM that determines the type of transformation applied to the data.
After that, the researcher designates 'all' as the value for k, indicating the selection of all features by SelectKBest. The 'SVM C' parameter, which represents several levels of regularization strength, is set to a range of values, including 0.1, 1, and 10, in the SVM classifier. Furthermore, two kernel options "linear" and "RBF" (Radial Basis Function) are defined for the 'SVM kernel' parameter, which affects how the SVM maps input data. The first one finds a linear hyperplane that separates the data points based on their features (positive, negative neutral). The second maps the data points into a higher-dimensional space where a linear hyperplane can separate them better than in the original space. The grid search process evaluates the model's performance in a systematic manner across different sets of these hyperparameter values in order to determine which set best improves model performance. The results are illustrated in (Fig. 7).
The figure conveys that utilizing "all" features, setting C to 10, and employing the RBF kernel provide the best performance for SVM with GloVeand TF_IDF features on the data. C is a hyperparameter for the SVM classifier that controls the trade-off between margin maximization and error minimization. A higher value of C means that the classifier tries to fit the data more closely, but it may also overfit and generalizes poorly. A lower value of C means that the classifier will allow more errors, but it may also underfit and miss important patterns. C = 10 means that the value is a reasonably high one of C for this SVM classifier with GloVeand TF_IDF features. The RBF kernel is frequently implemented for capturing non-linear relationships in the data, which is an indicative of the data's complexity and non-linearity. The model achieves accuracy scores of approximately 91.94% and 88.22% with GloVeand TF-IDF respectively. Such scores indicate how well the model can classify tweets into their respective sentiment categories. Higher scores normally indicate better model performance.
6.4 Classification reports of (SVM)
6.4.1 Classification report of (SVM-GloVe)
As revealed in (Fig. 8), with regard to sentiment analysis, the classification report thoroughly provides a thorough summary of the performance data obtained from an established Support Vector Machine (SVM) model that makes use of GloVe features. This assessment methodology includes support, recall, F1-score, and accuracy measures that are specific to the three sentiment categories (positive, neutral, and negative). Central measures of the model's prediction accuracy for every sentiment class are precision values, which are shown as noteworthy percentages. For the negative sentiment category, the accuracy is 91.83%, for the neutral sentiment category, it is 94.46%, and for the positive sentiment category as 91.25%.
The recall percentages surface as 95.17% for negative sentiments, 93.54% for neutral sentiments, and 82.39% for positive sentiments. The accuracy, a pivotal metric gauging correctness in classifying cases crosswise all sentiment subclasses, is 92.96%. This overarching accuracy metric provides a holistic viewpoint on the model's effectiveness in sentiment examination. Macro and weighted averages, integral portions of the report, offer nuanced appraisals accounting for stabilizing and circulation of sentiment categories. The macro-averaged precision, recall, and F1-score communicate as 92.51%, 90.37%, and 91.35%, respectively. Conversely, the weighted-averaged precision, recall, and F1-score arise at 92.97%, 92.96%, and 92.92%, correspondingly. These averages offer a more comprehensive assessment regarding the impact of class imbalances on the level of performance held by the model.
To summarize, the classification report delivers comprehensive and illuminating information about the SVM model's performance, providing subtle insights into its accuracy, recall, and F1-score for various sentiment categories. While macro and weighted averages offer a comprehensive picture of the model's performance in sentiment analysis tasks, the support metric and overall accuracy add more context. This thorough examination is essential for identifying the model's advantages and possible areas for improvement, enabling well-informed sentiment analysis decision-making.
6.4.2 Classification report of (SVM-TF-IDF)
The deployed (SVM) model's sentiment analysis performance is painstakingly evaluated in the classification report in (Fig. 9). The assessment is carried out via TF-IDF characteristics, a commonly employed method for determining a word's relevance inside a collection of documents. The report provides a detailed overview of the discriminative skills of the model by covering precision, recall, F1-score, and support metrics for each sentiment category (positive, neutral, and negative).
The precision values signify the degree of accuracy with which the model predicts each sentiment class. In the present case, the accuracy of the negative sentiment class is 86.54%, indicating that the model is capable of correctly classifying negative attitudes. The model's accuracy in predicting neutral attitudes is demonstrated by the neutral sentiment class's precision of 94.47%, whereas the positive sentiment class's precision is 90.70%.
Metrics for recall offer valuable information on how well the model captures examples from each sentiment category. Recall for the negative class is 95.63%, highlighting the model's capacity to accurately identify a significant percentage of real negative cases. With an 88.49% recall rate, the neutral class demonstrates a strong sensitivity to real-world neutral mood occurrences. Nevertheless, the recall of the positive class is 78.14%, indicating a somewhat lower capture of real positive cases.
F1-scores, which exhibit a harmonic mean of recall and precision, shed more light on the model's well-balanced performance. With an F1-score of 90.85%, the negative emotion class exhibits a well-balanced trade-off between recall and precision. Likewise, the class representing neutral sentiments has an F1-score of 91.38%, signifying a well-balanced approach to forecasting neutral sentiments. On the other hand, the F1-score of 83.95% for the positive sentiment class indicates a trade-off between recall and precision for positive sentiments.
The SVM model with TF-IDF features has an overall accuracy of 90.35%, which provides a global indicator of how accurate it is at categorizing instances in all sentiment categories. Weighted averages and macro statistics offer a thorough analysis that takes the impact of class disparities into account. The macro-averages for recall, F1-score, and precision are 90.57%, 87.42%, and 88.73%, in that order. On the contrary, the F1-score, weighted-averaged precision, and recall are 90.35%, 90.65%, and 90.70 percent, correspondingly.
To conclude, this thorough assessment provides a detailed overview of the SVM model's advantages and disadvantages in sentiment analysis tasks, offering insightful information that can be used to develop the model and make well-informed decisions.
The comparative study of SVM model that employs both of TF-IDF and GloVe characteristics has several implications. The TF-IDF feature is regularly outperformed by the SVM model combined with GloVe embeddings in terms of accuracy, precision, and F1-score for a range of attitudes. This consistency implies that GloVe embeddings play a major role in building a more sophisticated and reliable sentiment analysis model due to their capacity to capture complex semantic information.
Both features have a high level of competence when it comes to neutral emotions, demonstrating their ability to navigate through a wide range of expressions in this area. The F1-scores accurately reflect the subtle nature of neutral sentiment expressions, which forces models to strike a careful balance between recall and precision. With the highest recall in this category, the SVM model which utilizes TF-IDF features demonstrates a unique ability to efficiently capture occurrences of negative sentiment efficiently. This discovery suggests that TF-IDF could be vastly advantageous in recognizing manifestations of negative emotion, providing insightful information, especially in applications where detecting negative sentiments is critical. These findings highlight the significance of contemplating the intricacies of sentiment expressions and carefully evaluating the trade-offs between precision and recall when selecting a sentiment analysis model.
6.5 Pysentimento
This particular model does necessitate neither a preprocessing nor a training phase, as it is a pre-trained model equipped with a tokenizer. The process commences with the installation of the Pysentimento library (version 0.7.2) and its prerequisites. It procures the following dependencies: "accelerate" library (version 0.22.0), "datasets" (version 2.14.5), and "emoji" (version 1.7.0). The model architecture, "robertuito," suggests that this model is based on the RoBERTa architecture. RoBERTa is a variant of the BERT (Bidirectional Encoder Representations from Transformers) model and is known for its effectiveness in a wide range of NLP tasks. The model is fine-tuned specifically for sentiment analysis. Fine-tuning involves training a pre-existing model on a task-specific dataset. In this case, the model has been trained on a dataset of text samples with associated sentiment labels (positive, negative, and neutral). This fine-tuning process supports the model to learn the patterns and features relevant to sentiment analysis. Pre-trained models for sentiment analysis have become increasingly popular due to their effectiveness in capturing nuances in sentiment across various domains (Pérez et al., 2021)..
The "accelerate" library, designed for Python, focuses on optimizing and expediting computations, particularly within the realms of numerical and scientific computing. Moreover, the "emoji" library, also a Python library, augments the project with capabilities centered on "emoji" processing and management. This library enables tasks such as "emoji" detection, extraction, and manipulation within textual data. Its functionality extends to "emoji" identification, conversion between "emoji" and Unicode representations, and "emoji" visualization. Incorporating the "emoji" library is a sensible decision aimed at enhancing the project's capacity to adeptly manage emojis within the sphere of NLP endeavors, encompassing applications such as sentiment analysis, text classification, and text generation (carpedm20, 2015). The dataset dedicated to training purposes consists of 50,859 tweets related to Bitcoin. It is partitioned into an 80% training set and a 20% testing set previously. The testing dataset from the previous models, encompassing 10,174 tweets, is stored in a CSV file. This dataset will be employed to assess the accuracy of the Pysentimento model, ensuring uniformity across all three models under evaluation. (Fig. 10) and (Fig. 11) demonstrate the stages of Pysentimento model and the results of Pysentimento application on a portion of the testing dataset correspondingly. In (Fig. 8), the "label" column represents the classification of the training dataset whereas the "sentiment" is the one for Pysentimento.
6.5.1 Classification report of (Pysentimento)
The presented classification report concerns the "Pysentimento" model's performance following its application to test data. This report evaluates the model's capabilities in classifying text-based sentiments into three categories: "positive," "negative," and "neutral." It is crucial to analyze each aspect of the report to gain a comprehensive understanding of the model's performance as conveyed in (Fig. 12). The provided classification report pertains to the performance of the "Pysentimento" model after its application to testing data. This report evaluates the model's ability to classify text-based sentiments into three categories: "positive," "negative," and "neutral." It is crucial to analyze each aspect of the report to gain a comprehensive understanding of the model's performance.
When it comes to assessing the "positive" emotion category, the model performs admirably. For this category, the precision is 0.89, meaning that 89% of the feelings that the model predicted to be "positive" are indeed true. The recall for the "positive" category is equally impressive, with a score of 0.88, indicating that the model effectively captures 88% of actual positive sentiments present in the dataset. An F1-score of 0.88 in this situation indicates that the model is remarkably well-balanced, with a strong equilibrium between recall and precision. This equilibrium suggests that the model performs well in predicting "positive" feelings and in capturing the majority of genuine positive sentiments seen in the data.
The model's precision score in the "negative" sentiment category is 0.83, which means that 83% of the feelings it identified as "negative" are actually "negative." However, the recall for the "negative" category is 0.69, demonstrating that the model only accurately classifies 69% of the real negative attitudes in the dataset. For the "negative" category, the F1-score, a measure of precision and recall equilibrated, is recorded as 0.76. The model appears to be operating in this category quite harmoniously, based on its F1-score of 0.76. With a precision of 0.92 in the "neutral" sentiment category, the "Pysentimento" model does exceptionally well, correctly predicting 92% of the "neutral" feelings among its predictions.
Moreover, the recall for the "neutral" category is an impressive 0.97, highlighting the model's capacity to correctly classify 97% of actual neutral sentiments in the dataset. The F1-score for “neutral” is 0.94, denoting a balance between precision and recall. This score highlights the model’s ability to discern neutral sentiments. The “Pysentimento” model has an overall accuracy of 90%, demonstrating its competence in sentiment prediction across all categories.
In conclusion, the "Pysentimento" model performs admirably when tested utilizing test data. It maintains a balanced F1-score, demonstrating its efficacy in capturing various sentiment categories, and demonstrates notable strengths in precision, especially in the "neutral" category. The model's overall 90% accuracy rating specifies how consistently it can predict "positive," "negative," and "neutral" sentiments.