In the proposed study, machine learning models are trained for detecting fake news. The workflow of the proposed approach is shown in Fig. 2, which is divided into various phases, including data pre-processing, feature selection, and model training. The following is a general outline of the involved steps:
Step 1: Data collection: The liar dataset is collected from Kaggle.
Step 2: Data preprocessing: Data preprocessing techniques like Tokenization, Stemming and removal of punctuation marks and stop-words are utilized on the dataset to clean them or to eliminate features that are not needed from dataset and handle ambiguous data.
Step 3: Feature selection: After cleaning the dataset the next step is the Vectorization of data. Apply TF-IDF which is used to convert text data into a numerical sequence.
Step 4: Training and Testing Machine Learning algorithms
Step 5: Evaluation of the trained model using performance metrics.
Step 6: Finally, choose the best performing model on the basis of evaluation metrics.
Dataset
The liar dataset is a publicly available dataset of fact-checking on political statements. It comprises 10,240 news articles. The dataset, obtained from Kaggle, contains two columns – “statement” and six categorized classes: “true”, “half-true”, “mostly true”, “barely true”, “false”, and “pants on fire”.
Data preprocessing
Pre-processing was done using NLTK toolkit which is an open-source and widely used NLP library. It comes with inbuilt functions and algorithms such as nltk.tokenize method (for tokenizing text), nltk.stem.porter.PorterStemmer method (popular Porter stemming algorithm) and other such methods. This phase transforms the raw text into an understandable format for further processing. In this phase, the input dataset is pre-processed using various steps to remove the noise such as lowercase, punctuation and special characters, tokenization, stopwords, stemming, etc.
The pre-processing of the dataset consists of the following steps.
Tokenization is the process of breaking down text into smaller units or tokens, and it is typically the first step in data cleaning for Natural Language Processing (NLP) projects. NLTK, which stands for Natural language toolkit, is a group of libraries and nltk. tokenize method can be used for tokenization.
Stop words are insignificant words that don’t add much to a sentence's meaning or they don’t tell us much about data, they create noise or some words mislead the models without aiding them in detecting true and fake news therefore they are removed. Articles, prepositions, conjunctions, and some pronouns are considered stop words. Some of these commonly used stop-words are the, of, I, you, it, and, as a, about, an, are, as, at, be, by, for, from, how, in, is, of, on, or, that, these, this, too, was, what, when, where, who will, etc. To eliminate stop words from a sentence, the text is divided into words and then it is checked to see if the word is in the NLTK list of stop words. If the particular word exists in the collection of a corpus, the word is then eliminated.
Stemming helps to reduce the words to their root form. For example, words such as running, ran, and runner to their root word which is run. For this purpose, the Porter Stemmer algorithm is used, which is the most commonly used stemming algorithm. For example “cars” get reduced to a root term “car”. After preprocessing data, the cleaned data must be converted into a numerical format for further tasks.
Feature selection
The preprocessed data is passed to the Vectorization process which includes both CountVectorizer and Term frequency-inverse document frequency techniques. Both are used for text preprocessing in natural language processing tasks such as text classification and help to convert the text data into a numerical format. In this work, TF-IDF served as the primary feature selection method.
CountVectorizer is a useful tool in NLP to convert text data into a token count. It determines how often each word appears in the text and counts for each word in the text data. This creates a numerical representation of the text data.
For example: “The dog in the hat sat on the floor”
TF-IDF consists of two parts: term frequency and inverse document frequency. It is a statistical measure that evaluates the importance of a word in a given document or a corpus. TF-IDF is used to identify the most important terms or keywords in a document or set of documents. This is useful for many natural language processing tasks, such as text classification, information retrieval, and content analysis. Using TF-IDF, it becomes possible to automatically extract the most relevant and informative terms from a large collection of documents, making it easier to understand and analyze the content. TF-IDF is the product of term frequency and inverse document frequency. Term frequency is calculated how often a term occurs in a particular document. In other words, it is the ratio of the number of times a particular word appears in a document to the total number of words in that document, and Inverse document frequency measures how rare or unique the term is across all documents in the dataset.
TF - Number of times a word appears in a document / Total number of words present in the document
IDF – log (total number of documents/numbers of documents that contain the word)
The TF-IDF score of a term is calculated by multiplying the term frequency with the inverse document frequency. The higher the TF-IDF score of a term in a document, the more important the term is in that document. After converting textual data into a numerical form the machine learning models are trained on the numerical vectors and used to predict the output of the test dataset. Various Machine Learning techniques are used to train the models on the datasets. Naive Bayes, Logistic Regression, and Support Vector Machine algorithms are utilized to train the methods on both datasets.
Evaluation metrics
Accuracy: Accuracy is the ratio of the number of correct prediction to the total number of instance.
Precision: Precision is the ratio of the number of correct prediction to the total number of predicted positives instances.
Recall: Recall is the ratio of the total number of correctly predicted instance to the total number of actual positive instances.
F1score: F1 Measure is the harmonic mean of precision and recall.
TP, TN, FP, FN, represent the number of true positives, true negative, false positives and false negative respectively.