This section discusses the dataset used for training and testing the three models in addition to the research methods and materials of the paper.
Dataset: the SpamBase dataset [9] is considered to be used for training and testing the three different employed models. This dataset consists of 4601 instances of both spam and non-spam emails. A learning scheme of 50:50 is considered for training and testing such models in which 50% is used for training and 50% for testing.
Table 1 shows a sample of some spam and non-spam instances of the dataset. Note that the two classes are labeled as “1” for spam and “0” for non-spam.
Table 1
a sample of spam and non-spam instances of the used dataset
Sample number | Class | Content |
1 | Spam | Sunshine Quiz! Win a super Sony DVD recorder if you canname the capital of Australia? Text MQUIZ to 82277. B |
2 | Ham | As I entered my cabin my PA said, '' Happy B'day Boss !!''. I felt special. She askd me 4 lunch. After lunch she invited me to her apartment. We went there. |
3 | Spam | Today’s Voda numbers ending with 7634 are selected to receive a å£350 reward. If you have a match, please call 08712300220 quoting claim code 7684 standard rates apply. |
Naïve Bayes (NB): Naïve Bayes is a probabilistic machine learning algorithm for binary or multiclass classification tasks. Such algorithm is based on Bayes’s theorem [14] and it works by assuming that the occurrence of a certain feature is independent of the occurrence of other features. Baye’s theorem is used to determine the probability of hypothesis with prior knowledge. Baye's theorem is utilized to determine the likelihood of theory with earlier knowledge.
The working formula of Baye’s theorem is:
$$P\left(A|B\right)= \frac{P\left(B|A\right)P\left(A\right)}{P\left(B\right)}$$
1
Where \(P\left(A⃓B\right)\) is probability of hypothesis A, give that B is true. \(P\left(B⃓A\right)\) is the likelihood hypothesis B, given that A is true. P(A) and P(B) are the probabilities of hypotheses A and B, independently.
K-Nearest Neighbor (KNN)
K-nearest neighbors is a basic and simple algorithm that stores every accessible instance and predict classes of the new cases depending on a distance measure (e.g., Euclidean distance measures) [15]. For such algorithm, a new case is classified by the majority voting of its neighbors. This case is then assigned to its k-nearest neighbors by measuring its corresponding distances to all its neighbors. Different distance measure can be used to compute distance, however, in this work, the Euclidean distance is used and it is as follows
Support Vector Machine (SVM)
SVM is a machine learning algorithm that can be for both classification and regression problems. This algorithm works mainly by finding a hyperplane in N-dimensional space to classify data points in different classes [16]. The idea is to find the best plane that has the maximum margin. In other words, the plane in which the distance between classes and data points is the maximum. It is then important to maximize this margin. SVM algorithm can maximize the margin between data points and hyperplane by computing the minimizing cost function, i.e., Hinge loss. Hinge loss, Exponential loss, Logit loss and many other types of loss can be used to train the SVM. However, in this work, the hinge loss is used and it is defined as following
Where \(\theta\) is the angle between the two vectors, x and y.
Metrics for evaluation the models performance
several metrics are used in this work to evaluate the performance the three employed models for classifying email messages. These metrics include the accuracy, ROC, specificity, sensitivity.
$$Accuracy \left(Acc\right)=\frac{\text{T}\text{P}+\text{T}\text{N}}{\text{T}\text{P}+ \text{F}\text{N}+\text{T}\text{N}+\text{F}\text{P}}$$
4
$$Sensitivity \left(Sn\right)=\frac{\text{T}\text{P} }{\text{T}\text{P}+ \text{F}\text{N}}$$
5
$$Specificity \left(Sp\right)=\frac{\text{T}\text{N}}{\text{T}\text{N}+ \text{F}\text{P} }$$
6
Where TP stands for true negative, and it indicates the number of correctly predicted positive classes. TN stands for true negative, and it indicates the number of correctly predicted negative classes. FP is the false positive, and it shows the incorrectly predicted positive data, while FN is the false-negative, and it indicates the number of incorrectly predicted negative data. AUC is the area under the Receiver Operating Characteristic curve (ROC), which is a graph shows the performance of the network at tresholds. ROC plots the True positive rate versus the false positive rate