Comparison of Three Machine Learning Models for the Detection of Emails Spam

doi:10.21203/rs.3.rs-1445507/v1

Download PDF

Research Article

Comparison of Three Machine Learning Models for the Detection of Emails Spam

https://doi.org/10.21203/rs.3.rs-1445507/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Recently, machine learning has been applied into different major areas such as text classification, machine translation, and spam detection. The great performance of machine learning algorithms into several fields provided the humans with opportunities to tackle some of their hard jobs to be handled by machine learning systems. These tasks seem effortless for machines, and need less time as the amount of texts or spams need to be classified is huge. Hence, in his paper, we propose three different models for the task of emails spam detection. The three models are trained and validated on a public spam dataset. Experimentally, the models performed differently and it was seen that the Naïve Bayes outperformed the other machine learning algorithms in terms of accuracy and other evaluation metrics.

Machine learning

Naïve bayes

K-Nearest neighbor (KNN)

Support vector machine

Emails exchange is a method of correspondence among the individuals, and billions of cell and laptop phone users share various emails. In any case, such kind of correspondence is unsafe because of absence of legitimate message sifting techniques. The spam is the major reason of this weakness, as it threats the emails exchange between users [1]. Spams are unsolicited junk mails or messages, which are undesirable for beneficiaries which are shipped off the users without their earlier consent.

The email users invest the vast majority of their significant energy and time in arranging these spam sent emails [2]. Different duplicates of same message are sent ordinarily which not just influence an association monetarily, yet in addition aggravates the accepting users. Spam messages are not just encroaching the user's messages yet they are moreover delivering huge measure of undesirable information and along these lines influencing the organization's ability and utilization [3].

Mostly, emails have a very similar structure which consists of body of the email and its corresponding subject. A run of the mill spam mail can be ordered by filtering its substance. The cycle of spam mail detection depends on the supposition that the substance of the spam mail is not the same as the real or ham mail. For model words identified with the ad of any item, support of administrations, dating related content and so on. The cycle of spam email detection can be extensively sorted into two methodologies: machine learning and knowledge engineering approach [4]. The machine learning approach is to train machine learning algorithms to classify emails into spam or ham, and it showed more effective results than humans and knowledge engineering approaches [3, 4]. Moreover, machine learning showed promising outcomes in different engineering areas such as images classification [5, 6], prediction [7], and natural language processing [8].

In this study, we apply three different machine learning techniques for the task of classifying emails messages into spam or ham. The three algorithms classify the emails based on their content and other parameters. Naïve Bayes (NB), K-nearest neighbor (KNN), and support vector machine (SVM) are employed to classify the emails into spam and ham. The three models are trained using the Spambase Dataset [9] and their performance is compared and discussed.

Emails spam detection is an essential classification task in the cybersecurity field due to the importance of preventing frauds. Several researches have been conducted for the detections of spams. Machine learning have been extensively applied to solve such problem. Idris [10] proposed the detection of emails spam using neural network and negative selection algorithm. In this work, the performance of neural network for such task was compared with support vector machine and it was found the neural network using backpropagation achieved a higher accuracy than that reached by the svm.

Another research for spam classification using neural networks was proposed by Edstrom [11]. In their work, it was found that number of hidden layers doesn’t improve the accuracy of the network. A single hidden layer neural network was seen to achieve the best performance of 94,6% accuracy.

Sharma and Bhardwaj [12] developed a hybrid bagged approach for the classification of emails spams. Their system is a combination of Naïve Bayes and J48 algorithm trained to classify emails into spam or non-spam. Their experiments showed that their proposed system achieved a classification accuracy of 87.5%, which is lower than the one our system has reached. Moreover, Pandey et al., [13] proposed a system based on SVM and NB to classify emails. In their paper, SVM achieved an accuracy of 91% while NB reached 92%.

This section discusses the dataset used for training and testing the three models in addition to the research methods and materials of the paper.

Dataset: the SpamBase dataset [9] is considered to be used for training and testing the three different employed models. This dataset consists of 4601 instances of both spam and non-spam emails. A learning scheme of 50:50 is considered for training and testing such models in which 50% is used for training and 50% for testing.

Table 1 shows a sample of some spam and non-spam instances of the dataset. Note that the two classes are labeled as “1” for spam and “0” for non-spam.

Table 1

a sample of spam and non-spam instances of the used dataset
Sample number	Class	Content
1	Spam	Sunshine Quiz! Win a super Sony DVD recorder if you canname the capital of Australia? Text MQUIZ to 82277. B
2	Ham	As I entered my cabin my PA said, '' Happy B'day Boss !!''. I felt special. She askd me 4 lunch. After lunch she invited me to her apartment. We went there.
3	Spam	Today’s Voda numbers ending with 7634 are selected to receive a å£350 reward. If you have a match, please call 08712300220 quoting claim code 7684 standard rates apply.

Naïve Bayes (NB): Naïve Bayes is a probabilistic machine learning algorithm for binary or multiclass classification tasks. Such algorithm is based on Bayes’s theorem [14] and it works by assuming that the occurrence of a certain feature is independent of the occurrence of other features. Baye’s theorem is used to determine the probability of hypothesis with prior knowledge. Baye's theorem is utilized to determine the likelihood of theory with earlier knowledge.

The working formula of Baye’s theorem is:

$$P\left(A|B\right)= \frac{P\left(B|A\right)P\left(A\right)}{P\left(B\right)}$$

Where $P\left(A⃓B\right)$ is probability of hypothesis A, give that B is true. $P\left(B⃓A\right)$ is the likelihood hypothesis B, given that A is true. P(A) and P(B) are the probabilities of hypotheses A and B, independently.

K-Nearest Neighbor (KNN)

K-nearest neighbors is a basic and simple algorithm that stores every accessible instance and predict classes of the new cases depending on a distance measure (e.g., Euclidean distance measures) [15]. For such algorithm, a new case is classified by the majority voting of its neighbors. This case is then assigned to its k-nearest neighbors by measuring its corresponding distances to all its neighbors. Different distance measure can be used to compute distance, however, in this work, the Euclidean distance is used and it is as follows

Support Vector Machine (SVM)

SVM is a machine learning algorithm that can be for both classification and regression problems. This algorithm works mainly by finding a hyperplane in N-dimensional space to classify data points in different classes [16]. The idea is to find the best plane that has the maximum margin. In other words, the plane in which the distance between classes and data points is the maximum. It is then important to maximize this margin. SVM algorithm can maximize the margin between data points and hyperplane by computing the minimizing cost function, i.e., Hinge loss. Hinge loss, Exponential loss, Logit loss and many other types of loss can be used to train the SVM. However, in this work, the hinge loss is used and it is defined as following

Where $\theta$ is the angle between the two vectors, x and y.

Metrics for evaluation the models performance

several metrics are used in this work to evaluate the performance the three employed models for classifying email messages. These metrics include the accuracy, ROC, specificity, sensitivity.

$$Accuracy \left(Acc\right)=\frac{\text{T}\text{P}+\text{T}\text{N}}{\text{T}\text{P}+ \text{F}\text{N}+\text{T}\text{N}+\text{F}\text{P}}$$

$$Sensitivity \left(Sn\right)=\frac{\text{T}\text{P} }{\text{T}\text{P}+ \text{F}\text{N}}$$

$$Specificity \left(Sp\right)=\frac{\text{T}\text{N}}{\text{T}\text{N}+ \text{F}\text{P} }$$

Where TP stands for true negative, and it indicates the number of correctly predicted positive classes. TN stands for true negative, and it indicates the number of correctly predicted negative classes. FP is the false positive, and it shows the incorrectly predicted positive data, while FN is the false-negative, and it indicates the number of incorrectly predicted negative data. AUC is the area under the Receiver Operating Characteristic curve (ROC), which is a graph shows the performance of the network at tresholds. ROC plots the True positive rate versus the false positive rate

In this section, the training and testing performance of the employed models are discussed.

Preprocessing: In this work, a machine learning inspired approach is proposed for spam mail detection. In spam mail identification framework, the first step is to collect the data which are unstructured in nature. Thus, a preprocessing is required [1, 2]. Therefore, to decrease the calculations and to obtain good results, email information should be pre-processed. Hence, data are first processed by deleting stop words. Moreover, word tokenization is likewise performed in order to secure important data. Finally, data are fed into the machine learning algorithms (NB, KNN, SVM) to be classified as spam or non-spam. Figure 1 shows the preprocessing stage of emails filtering.

Classification Results: In this work, the three employed models (BN, KNN, SVM) are trained and tested using data from the SpamBase dataset. The learning scheme used to train and test the models is 50:50 i.e., 50% of images are used for training, while the remaining 50% are used as a held out test set for evaluating the network’s performance. Note that the network is evaluated by calculating its training and testing accuracy and loss using the formulas in Eqs. (3) and (4). Figure 1 shows the percentage of spam and non-spam (Ham) content in the used dataset.

Table 2

shows the evaluation metrics of the models in classifying emails into messages or non-spam.
	NB	KNN (K = 1)	SVM
Accuracy	97%	89%	94%
Sensitivity	93%	80%	83%
Specificity	87%	71%	78%
AUC	93%	79%	81%
Processing time	0.494 (s)	0.630 (s)	0.86.032 (s)

In this paper, Naive Bayes (NB), K nearest neighbors (KNN) and support vector machine (SVM) are the classifiers used to detect emails spams. The models are validated with 50% of the data and the experimental results are shown in Table 1. According to that able, the KNN achieved the highest accuracy among them all. It achieved an accuracy of 97%, sensitivity of 93%, specificity of 87%, and area under curve (AUC) 0f 93%. In terms of computation time, Table 1 shows that the NB required a shorter time to achieve such accuracy than SVM and KNN.

On the hand, the other models achieved accuracies of 89% and 94% for KNN and SVM, respectively. The AUC graph of the three models is shown in Fig. 3. Moreover, the receiving operating characteristic (ROC) is shown in Fig. 4. This ROC curve represents the measure of severability of the two classes: spam and non-spam. Hence, it is noted that the NB outperformed the other models in terms of AUC as it achieved a better degree of separability of the two classes.

Figure 3 shows the area under of curve of the Naïve Bayes (NB) which the model that outperformed all other models.

Comparison with other related works: Researches have been extensively conducted to develop machine learning based detection system for emails spam. Sharma and Bhardwaj [13] proposed a hybrid bagged approach for the classification of emails spams. Their experimental results showed achieved a classification accuracy of 87.5%, which is lower than the one our system has reached. Moreover, Pandey et al., [14] applied SVM and NB to classify emails spam. In their paper, it was seen that SVM achieved an accuracy of 91% while NB reached 92%. Table 3 shows the comparison of our work with the discussed paper, and it is seen that our proposed NB outperformed the two other systems.

Table 3

Results comparison with earlier works
	Accuracy
Hybrid bagged approach [13]	87.5%
SVM [14]	91%
NB [14]	92%
Our work	97%

In this paper, a comparisons of three machine learning algorithms for the detection of emails spam is presented. This study shows that different models can behave differently over the same spam dataset, in which the performance can differ. The three models are trained and validated on a public dataset (SpamBase dataset) and different evaluation metrics are used to evaluate the performance of the models. The performance of the models was compared and it was noticed that the Naive Bayes model outperformed all other models in terms of accuracy, sensitivity, specificity, and area under curve. Moreover, the processing time of all algorithms was computed and the support vector machine (SVM) required the longest time for processing.

Finally, the NB that achieved highest accuracy was compared to related works and researches. It was seen that our NB model was capable of reaching higher accuracy than other related models such as hybrid bagged approach, SVM, and NB.

Ethical approval

No animal or humen subject were used in this research

Conflict of interest

There is no Conflict of interest

Authorship contributions

All authors contributed equally to this work

GuangJun L, Nazir S, Khan HU, Haq AU (2020) Spam Detection Approach for Secure Mobile Message Communication Using Machine Learning Algorithms. Security and Communication Networks, 2020
Crawford M, Khoshgoftaar TM, Prusa JD, Richter AN, Najada A, H (2015) Survey of review spam detection using machine learning techniques. J Big Data 2(1):23
Guzella TS, Caminhas WM (2009) "A review of machine learning approaches to spam filtering". Expert Syst Appl 36(7):10206–10222
Mohamad M, Selamat A (2015) "An evaluation on the efficiency of hybrid feature selection in spam email classification", In: Proc. of 2015 International Conference on Computer, Communications, and Control Technology (I4CT), Kuching, Sarawak, Malaysia, pp.227–231,
Helwan A, El-Fakhri G, Sasani H, Uzun Ozsahin D (2018) Deep networks in identifying CT brain hemorrhage. J Intell Fuzzy Syst 35(2):2215–2228
Abiyev RH, Ma’aitah MKS (2018) Deep convolutional neural networks for chest diseases detection. Journal of healthcare engineering, 2018
Helwan A, Ozsahin DU, Abiyev R, Bush J (2017) One-year survival prediction of myocardial infarction. Int J Adv Comput Sci Appl 8(6):173–178
Arik SO, Pfister T (2019) Tabnet: Attentive interpretable tabular learning. arXiv preprint arXiv:1908.07442
https://archive.ics.uci.edu/ml/datasets/spambase
Idris I (2011) E-mail spam classification with artificial neural network and negative selection algorithm. Int J Comput Sci Communication Networks 1(3):227–231
Edstrom A (2016) Detecting Spam with Artificial Neural Networks
Rish I (2001), August An empirical study of the naive Bayes classifier. In IJCAI 2001 workshop on empirical methods in artificial intelligence (Vol. 3, No. 22, pp. 41–46)
Sun B, Du J, Gao T (2009), November Study on the improvement of K-nearest-neighbor algorithm. In 2009 International Conference on Artificial Intelligence and Computational Intelligence (Vol. 4, pp. 390–393). IEEE
Zhang Y (2012), September Support vector machine classification algorithm and its application. In International Conference on Information Computing and Applications (pp. 179–186). Springer, Berlin, Heidelberg
Sharma P, Bhardwaj U (2018) Machine Learning based Spam E-Mail Detection. Int J Intell Eng Syst 11(3):1–10
Pandey P, Agrawal C, Ansari TN (2018) A Hybrid Algorithm for Malicious Spam Detection in Email through Machine Learning. Int J Appl Eng Res 13(24):16971–16979

Download PDF

Version 1

posted

You are reading this latest preprint version

Comparison of Three Machine Learning Models for the Detection of Emails Spam

Status:

Version 1

Abstract

Figures

1. Introduction

2. Related Works

3. Materials And Methods

4. Results Analysis

Discussion

Conclusion

Declarations

Ethical approval

Conflict of interest

Authorship contributions

References

Status:

Version 1