This research employs opinion mining techniques using the Naïve Bayes Classifier (NBC) and Gradient Boosted Machines (GBM) to analyze sentiments expressed on Twitter regarding the Constitutional Court’s decision on the 2024 Indonesian Presidential Election. The NBC method is a probabilistic classifier known for its simplicity and effectiveness in text classification, which categorizes text based on the likelihood of sentiment occurrence, considering the linguistic features present in the text. Concurrently, GBM enhances prediction accuracy by iteratively refining models that focus on misclassified cases by previous iterations.
The following are the stages that will be carried out in sentiment analysis in this study:
1. Data Collection
The process of collecting data relevant to the research. In this context, it involves collecting tweets or text from platform X related to the research topic.
2. Dataset
The dataset used is a collection of Platform X (or Twitter) comments that use hashtags related to the Constitutional Court's decision trending 5 days before the day when the constitutional court's final decision regarding the Dispute over the Results of the 2024 Presidential and Vice Presidential Elections on April 22, 2024. The dataset is all trending topics with queries BarengPrabowoBangunBangsa, Gibran, Constitutional Court, Constitutional Court Verdict, PeopleWin, Anies Muhaimin, and Ganjar Mahfud. The total data that can be retrieved is 11,721 lines that include username, tweet ID, date, Tweet text content, and source.
3. Data Pre-processing
The stage of preparing the raw text for analysis by cleaning it of unnecessary elements or not providing important information. These include:
1. Case Folding: Normalizing all text to lowercase to ensure uniformity.
2. Filtering: Removing non-essential characters like numerals and symbols to focus purely on textual data.
3. Stopword Removal: Eliminating common words that offer minimal unique information for analysis.
4. Tokenizing: Segmenting text into individual words or tokens.
5. Stemming: Reducing words to their base or root form to simplify the analysis.
4. Labelling
The process of assigning a category or class label to each record based on specific criteria. In this study the text labels used were positive, negative, or neutral, done manually or using pre-trained algorithms.
5. TF-IDF
Term Frequency-Inverse Document Frequency (TF-IDF) is employed to transform text into a vector format. This statistical measure evaluates the importance of a word within a corpus, providing a weighting factor that enhances the effectiveness of the Naïve Bayes model in distinguishing relevant terms in tweets.
6. Model Training
a. Naive Bayes Model Training
The process of training the Naive Bayes classification algorithm with labeled datasets. The model uses probability and statistics to predict sentiment labels from text, trained with datasets that have been supplemented with sentiment labels in the previous step
b. GBM Model Training
The process trains a Gradient Boosted Machines model, which is an advanced classification algorithm that uses boosting techniques to generate predictive models. Explanation: GBM builds a series of weak predictive models incrementally, with each new model attempting to correct errors from the previous model.
7. Model Evaluation
The process of testing a trained model against a test dataset to assess model performance using metrics such as accuracy, precision, recall, and F1 scores through "performance" and "cross validation" methods. Explanation: This metric gives an idea of how well the model can classify new data and how far the predicted results correspond to actual values.
8. Analysis of Results
This process performs an examination of the classification results, looking for relationships, relationships and patterns in the data predicted from the model created
9. Analysis Tools
This study used RapidMiner as the main tool for data analysis. RapidMiner is an advanced and user-friendly data analysis platform, which allows users to perform various analytical processes without the need for in-depth programming knowledge [24]. This includes data preprocessing, modeling, and model evaluation.