Python is the primary programming language used for natural language processing in this study. Recent studies have choosen CNNs, LTSMs, and RNNs as their deep learning frameworks. Keras and Tensorflow libraries as well as conventional sklearn NLP methods are employed during this study [4]. The results and ROC AUC score are measured using a relative scale to determine the maximum detectability of state-sponsored propaganda [5].
All known tweets identified by U.S. intelligence agencies from the 2016 presidential election cycle were analyzed [6]. A machine learning algorithm was written and trained on existing curated rumor detection datasets.
The ROC analysis was developed based on the vectorized tweet content. If AUC was not sufficient, it considered metadata such as the publisher, timestamp, and network medium. The predicted AUC indicated how well the vectorized text predicted the truthfulness of a given tweet.
More than 1.2 million Russian misinformation tweets were used to test the sentiment analysis algorithm, which was trained on known reputable curated rumor detection datasets [7]. Wang presents the LIAR dataset, cited in previous works reviewed in this dissertation. The LIAR dataset curators collected 10 years of manually labeled short pieces of text from the website PolitiFact.com. PolitiFact provides detailed sourcing and labeling for every specific case determination. The LIAR dataset curation team designed a hybrid, surface-level linguistic neural network model that integrates the metadata described with the associated text [8].
The necessary data were obtained by reviewing the work of Darren L. Linvill and Patrick L. Warren, which they explained as follows: Our research employed a data set of 9.03 million tweets released by Twitter on October 17, 2018 (Gadde & Roth, 2018). These tweets came from 3,661 accounts, which are a subset of the 3,841 accounts given by Twitter to Congress. A list of these account handles was released on June 18, 2018 by the U.S. House Intelligence Committee (Permanent Select Committee on Intelligence, 2018). The Twitter release included hashed/de-identified versions of account handles for accounts with fewer than 5000 followers. We used an alternate version of the Tweets we collected for an earlier draft of this project to re-identify most of the accounts.
Linville and Warren distilled the dataset down to 3 million tweets and added features related to the political leanings of tweet authors within the dataset [6]. Their dataset was selected for this research because the additional features were useful for understanding the context of the data. The news organization “Five Thirty Eight” made the large dataset publicly available in a Github repository for researchers and analysts. The researcher’s footnote read “add data update from Clemson U. researchers [7].”
The researchers employed a combination of quantitative and qualitative methods to compile this data. To interpret and summarize emergent themes within the body of text, they conducted axial coding [7].
After training the support vector classifier using the LIAR dataset, the testing set was replaced with the IRA tweet data [7, 8]. To allow the classifier to evaluate the accuracy of the prediction, a new column was created within the data. This column contained a true/false category, enumerated as 1 or 5, and all known IRA tweets were set to “1” or false. Internet Research Agency tweets are classified as false based on the following assumptions:
1. The list of IRA twitter handles submitted to the U.S. House Intelligence Committee
in 2018 was correct.
2. Tweets associated with the IRA authors are captured and represented within the
dataset published by Linville and Warren.
3. All tweets published by the IRA are misinformation. Since the IRA mounted an
active and well-documented misinformation campaign, this research assumes that
all information published by them was not credible.
The machine learning models were trained using the LIAR dataset to detect misinformation [8]. The models selected for testing were chosen based on the research of Ries et al., as outlined in Section 3.2. When Ries et al. used supervised learning for fake news detection [8], they calculated the following summary statistics for each model:
“Classifier” indicates the machine learning model used, and F1 represents the f-score comparison of true positive rates determined by the classifier. First developed by Fix and Hodges in 1951, the K Nearest Neighbors (KNN)[9] classifies data into clusters using predefined labels in supervised learning [10]. When making a prediction, KNN determines what cluster a new datapoint falls within based on its nearest neighbors and classifies the point accordingly.