Dataset Description
The dataset for this investigation is sourced from publicly accessible databases and Application Programming Interfaces (APIs) that allow for seamless integration with Python programming.
a. Bitcoin Price Data
The Bitcoin market data was obtained through the Historic-Crypto library, which is an open-source tool facilitating interaction with the CoinBase Pro API. This dataset encompasses hourly data points for Bitcoin, including the Open, High, Low, Close, and Volume metrics, spanning from February 5th, 2021, to June 20th, 2021.
b. Twitter Data
The Twitter dataset was compiled from Kaggle.com, utilizing the opendatasets library for initial access, and subsequently employing the Tweepy Python library to gather tweets tagged with #bitcoin and #BTC. The dataset features include user-related information (username, location, description, account creation date, follower and friend counts, favorites) and tweet-specific details (timestamp, hashtags, platform used, retweet flag). For the purpose of this study, only the date and textual content of tweets within the specified period (February 5th, 2021, to June 20th, 2021) were analyzed and modeled. The tweets were recorded at 1-minute intervals.
c. Google Search Trends
Data on Google search trends related to Bitcoin and BTC were extracted using the pytrends API, a Python interface. This involved querying the Historical Hourly Interest feature for the aforementioned keywords, covering the same period as the Bitcoin and Twitter datasets (February 5th, 2021, to June 20th, 2021).
Data preprocessing and transformation
The framework operates through an iterative process comprising eight key phases: Problem Definition, Data Acquisition, Preprocessing and Transformation of Data, Exploratory Data Analysis, Correlation Analysis, Feature Selection, Model Development, and Model Optimization (Fig. 1). In the data preprocessing and transformation phase, the initial step involved cleansing the tweet text data to ensure a pristine dataset. This fundamental cleansing removed irrelevant characters such as URLs, hashtags (e.g., #BTC, #bitcoin), user mentions, and punctuation, employing regular expressions for efficient automation [8]. Subsequently, morphological analysis was conducted using the nltk library, which facilitated the removal of stopwords—common but insignificant words like 'the', 'is', and 'a'—that could potentially obscure the analytical clarity of the text data [8]. Additionally, this phase included word tokenization, breaking down sentences into individual words or tokens, and lemmatization, a process of reducing words to their base or dictionary form to ensure consistency in the text data [9].
Feature engineering played a pivotal role in enhancing the dataset by extracting predictive features from the raw data [10]. This included the computation of tweet volume, which aggregated the number of tweets per hour, providing a quantitative measure of Twitter activity related to Bitcoin. Sentiment analysis of the tweets was performed to gauge public opinion towards Bitcoin, categorizing sentiments into positive, negative, and neutral. This was achieved using the Textblob library for basic polarity and subjectivity analysis, and the Vader scoring system for a more nuanced sentiment analysis that could interpret slang and emoticons commonly found in social media texts[11, 12]
To align the various data sources for analysis, data merging was undertaken, standardizing the sentiment data to hourly intervals to match the granularity of the Bitcoin price and Google Trends data. This consolidation used the date as a primary key for merging datasets.
Correlation analysis was conducted to understand the relationships between variables, using the Pearson correlation coefficient formula to quantify the strength and direction of these relationships [7]. Feature selection was then applied to refine the model by identifying and retaining only those features with significant predictive power, thereby enhancing computational efficiency and reducing the risk of overfitting[5]. Various feature selection techniques, including filter and wrapper methods, were explored to determine the most effective features for predicting Bitcoin prices. The F-test, a statistical filtering approach, was particularly noteworthy for its ability to rank features based on their individual contributions to the model's performance, using F-statistics and p-values to highlight significant attributes [5].The SelectKBest function ranks the F-score from highest to lowest. It returns the most significant features, ranked from the highest correlated to the least correlated features to the target variable.
SHAP (Shapley Additive exPlanations) values emerge from game theory and offer a method to attribute the contribution of each feature in a predictive model to the overall prediction outcome [6]. Originating from the work of Lloyd Shapley, these values address the fundamental question: "How to fairly distribute the 'payout' (or contribution to the outcome) among the 'players' (or features) in a coalition (or model)?" [6]. The fairness criterion is assessed by considering all possible combinations or coalitions of features both with and without a particular feature, to determine its marginal contribution to the prediction. In practice, this involves systematically omitting each feature from the model, measuring the change in prediction accuracy, and then averaging this impact over all possible feature combinations. This average marginal contribution of a feature to the prediction accuracy of the model is its Shapley value [6].
For machine learning models, SHAP values offer a granular explanation of how each feature contributes to individual predictions, thereby enhancing model transparency and interpretability. They quantify the additive impact of including a specific feature (Xi) in the prediction by comparing the model's performance with and without (Xi), across all possible feature subsets. The SHAP value of (Xi) is thus the weighted average of the differences in prediction outcomes with and without (Xi), reflecting its contribution to the model's predictive power[6]. This methodological approach not only enhances our understanding of model behavior but also aids in identifying features that significantly influence model predictions, contributing to more informed decision-making and model improvement.
In Python, the SHAP library implements Tree SHAP, specifically designed for decision tree-based models, including ensemble techniques like Random Forest and Gradient Boosted Trees. Tree SHAP optimizes the computation of Shapley values by focusing on a subset of features and fitting the data to produce approximations of these values, thus offering a practical solution to the otherwise computationally demanding task of calculating exact Shapley values across all feature permutations [13].
Feature Selection Algorithm
The feature selection process utilizes tree-based algorithms, notably CART (Classification and Regression Trees). In this context, the Random Forest model, an ensemble approach, aggregates predictions from multiple decision trees to enhance prediction accuracy and reliability [12, 14]
Modeling with LSTM
LSTM (Long Short-Term Memory) networks, an advanced variant of Recurrent Neural Networks (RNNs), are adept at capturing long-term dependencies in sequence data[15]. Unlike traditional ANNs and RNNs, which may struggle with long sequence data due to short-term memory limitations, LSTMs incorporate a system of gates that manage information flow, thereby mitigating these issues.
LSTMs are characterized by three types of gates:
-
Forget Gate: Utilizes a sigmoid function to decide which information from the cell state should be retained or discarded, with values close to 0 indicating "forget" and values near 1 signifying "retain" [16].
-
Input Gate: Determines updates to the cell state by combining the sigmoid function's output (which filters information) with the output of a tanh function (which scales data between − 1 and 1), effectively deciding new information to be added to the cell state [16].
-
Output Gate: Dictates the next hidden state by filtering the cell state through a tanh function, scaled by the sigmoid function's output, thus determining the portion of the cell state to contribute to the output [16].
These gates interact within the LSTM architecture to maintain a cell state that carries relevant information through the sequence, enabling the network to make informed predictions based on both recent and long-term past information. The LSTM model processes input (Xt), utilizes memories from the previous cell state (Ct−1), and the output from the last LSTM unit (Ht−1) to update its gates and cell state, ultimately producing the new cell state (Ct) and the output or next hidden state (Ht).
Activation functions within the LSTM gates, specifically the sigmoid (sig) and tanh functions, play crucial roles in regulating the information flow, ensuring that the LSTM can learn and remember over long sequences effectively (Fig. 2) [17].