Machine learning is a branch of artificial intelligence that models the extracted data to produce the expected future. Additionally, the computer algorithm should receive a set of instructions to understand the nature of the data. The core concept of machine learning is algorithm design that offers the machine to identify the set of data and classify it based on the attributes of the data. The learning process occurs by using data extracted by the algorithm after removing some noise (Conway & White, 2012). The classification techniques help the learning algorithm to make an effective decision. Machine learning is capable of evaluating past and existing risks to obtain improved future performance (Blum & Langley, 1997). There are four major types of machine learning algorithms that are usually employed in the BYOD security implementation, including supervised, unsupervised, reinforcement, and deep learning, which are briefly described below:
Supervised learning approaches can be utilized for detecting threats and attacks in a BYOD environment, and to create a countermeasure. Supervised learning is the most useful learning algorithm in ML where the output is classified according to the input by employing trained data for the algorithm to learn. Supervised learning is of two categories which include classification and regression learning (Tahsien et al., 2020). Classification is a type of machine learning algorithm, whereby the output is a fixed or categorical value, which could be represented as [yes or No], or [True or False]. Examples of supervised classification learning algorithms include support vector machines, decision trees, random forest, k-nearest neighbor, association rule, and Bayesian theorem. Regression learning on the other hand is a type of supervised learning whereby the learning output is a continuous value depending on the input variables. Some examples of the regression-learning algorithm include neural networks, Decision Trees, Ensemble Learning, etc.
Unsupervised learning is a type of learning algorithm employed in complex data analysis and categorization. In Unsupervised learning, there is no target data for a given input value. This type of learning does not require labeled data and can examine the unlabeled data and categorizes the data into different groups as clusters. Various unsupervised learning techniques have been employed for BYOD security for privacy protection using the infinite Gaussian mixture model (IGMM) to detect DoS attacks using multivariate correlation analysis (Tan et al., 2013). Some examples of unsupervised learning algorithms include k-means clustering and principle component analysis.
Semi-supervised learning on the other hand comprised the combination of both supervised and unsupervised learning algorithms (Shah & Shankarappa, 2018). Thus, the semi-supervised learning algorithm sits between supervised and unsupervised learning, having the ability to deal with the labelled datasets and unlabeled datasets for all the observations. In some practical circumstances, the labelling of the dataset is quite high since it needs human expert opinion to perform the labelling. Thus, when the majority of the observation does not require labelling of data but a few of them, semi-supervised learning deems to be the suitable algorithm for model construction (Hussain et al., 2020).
Reinforcement machine learning is a type of learning that is usually employed in the gaming environment. In this form of learning, the algorithm learns based on the interaction with its environment (similar to human interaction) by executing an action that increases the overall feedback (Mnih et al., 2015). However, the feedback might be a return that relies on the performing task output. In reinforcement learning, there is no initial action for any task to be performed while the algorithm utilizes trial and error methods. Thus, the learning agent can recognize and implement the best method from its experience to obtain the best reward based on trial and error.
A subset of machine learning algorithms also referred to as deep learning is another learning model usually employed in the implementation of BYOD security. Deep learning is a machine learning approach that comprises an architecture, which is centered on artificial neural networks (ANNs). Artificial neural networks are supervised deep learning algorithm that is stimulated by the brain. However, it does not imply that the ANNs work basically as the biological brain. The neural network consists of neurons (referred to as variables) connected via weighted connections (usually regarded as parameters). The network is connected with either a supervised or unsupervised learning approach to attain the desired performance results. The learning itself is performed by employing the labelled and unlabeled data respectively from the supervised and unsupervised learning approaches followed by the iteration modification of the weights among every pair of neurons. Thus, while describing deep learning, we refer to a larger neural network where the term deep denotes the number of that network layers (Yang et al., 2014). In the early times of artificial neural networks, it was hard to train the network because of the constraints in computational powers, even for relatively networks. However, the advancement of technology has brought about more effective methods such as graphical user interfaces (GPUs) for estimating the optimal network weights, which permits the construction of larger networks containing more hidden layers. Even though it is not a severe rule, artificial neural networks that contain more than one hidden layer are regarded as deep learning models. Some deep learning models used in the BYOD implementation are the convolutional neural networks, recurrent neural networks, autoencoders, etc.
3.1 Review of Machine Learning Techniques for Detecting and Combating Security Threats and Attacks in Bring Your Own Device (BYOD) Environment
This section provides a review of machine learning techniques for security threats and attack implementation in the BYOD environment. The review is based on some aspects of BYOD machine learning implementations such as the dataset used, the machine learning algorithm employed, and the performance measures adopted. Specifically, this section is categorized into three major subheadings. Section 3.1.1 looks into the different datasets used in machine learning-based approach implementation for BYOD security threats and attacks. Section 3.1.2 discusses the various machine learning algorithms employed for the implementation of BYOD security threats and attacks while section 3.1.3 reassesses the different performance evaluation metrics considered by different authors to assess the performance of machine learning implementation of BYOD security threats and attacks. This section followed the same concept used in (Christopher Ifeanyi Eke et al., 2019). The summary of the review is shown in Table 1.
3.1.1 Review of Datasets employed in Machine Learning Algorithm for Detecting and Combating Security Threats and Attacks in the BYOD Environment.
Learning models generally are based on the past occurrences or experiences of an event or scenario. This scenario is referred to a dataset which is a key element used to train, test, and implement BYOD security models. Thus, the first step in an attempt to implement machine learning techniques for detecting and combating security threats and attacks in the BYOD environment is dataset gathering. The findings as summarized in Table 4 demonstrate different datasets utilized to implement the machine learning approach for detecting and combating security threats and attacks in the BYOD environment. The selected study’s analysis shows that datasets can be generally classified into homogeneous and heterogeneous data. When the author uses one type of dataset, it is termed a homogeneous dataset. On the other hand, when more than one type of dataset is used to perform machine learning for detecting and combating security threats and attacks in the BYOD environment, it is called heterogeneous data. Thus, the review of the datasets utilized according to the nature of the dataset used is explained as follows.
a Homogeneous Datasets
In a homogeneous dataset, the authors employed only one type of dataset. For instance, Shah and Shankarappa (2018) utilized a homogenous data source called the MDM events log. The MDM here is a scheme implemented in a BYOD environment to control and monitor the role of smartphones including their data operations. In a separate study, Chizoba et al. (2020), used homogenous data generated from network traffic logs when packets are transferred between networks. The network logs were used to implement the machine learning approach for BYOD security threats and attacks. Muhammad et al. (2017), leveraged the IAT packet data gotten from the local network of the Institute of Technology Georgia. The packet IAT data of 27 mobile devices were collected using UDP, TCP, and ICMP protocols. The dataset is homogeneous in nature as it contains only the inter-arrival time of packets sent in a BYOD environment. In a related study Muhammad et al. (2019), employed a test-bed dataset carefully gathered via mobile devices without meddling. The dataset contains Inter-Arrival Times of 27 mobile devices such as tablets, laptops, and smartphones to evaluate device type profiling. The data is homogeneous in nature as it contains only the inter-arrival time of 2 successive packets. In another study, Petrov and Znati (2018), utilized the MIT dataset that is made up of 84 issues of phone event records such as call start time, incoming / outgoing direction including the type of calls (phone, data, or message call), etc. Eslahi et al. (2016), in their study on botnet detection, utilized a network traffic dataset generated from a mobile botnet. However, a data sieving approach is employed in the model to gather only the HTTP traffic records only during HTTP and server communication. In another research conducted by RIASAT et al. (2017), a publicly available android malware dataset gathered from the Contagio mobile was utilized. The data contain 600 samples composed of two segments (reptiles crawling and the malicious applications of the contagio library.
b Heterogeneous Datasets
In a heterogeneous dataset, the data are obtained from various sources. For instance, the study conducted by (Arora & Bhatia, 2019) utilized several different datasets such as FVC2006, ATVSFFpDB, Spoofing-Attack Finger Vein Database, and LivDet 2013 fingerprint Datasets, to experiment with the model. However, datasets from LivDet 2015 were used to test the model. These datasets are heterogeneous as they consist of different fingerprint biometrics, which originated from different sources. In another study, Yerima et al. (2013), used 2000 samples of malware and benign dataset out of which malware consists of 1000 and benign takes the other half of 1000 samples. The authors asserted that there is high variability in the malware samples than in the benign samples when 20 features were chosen. In a related study, Chen et al. (2016), employed two distinct dataset sources (benign and malware) which amount to 7,970 samples. The benign sample is made up of 4,350 while the malware sample constitutes 3,620 samples. The analysis stating which dataset achieved a better result is not considered in the study. Similarly, Lashkari et al. (2017) in their proposed framework for android malware characterization and detection utilized both benign and malware datasets for machine learning classification to train the model. The author collected 1527 benign apps from the google play market between 2015 and 2016. The collection of these apps relies on how popular they are for each category present in the market. However, the author stated that 27 of the apps were eliminated before the modeling phase because they were classified as suspicious by the two different anti-virus products. On the other hand, 400 malware apps were collected based on two classes (adware, containing 250 apps, and general malware consisting of 150 apps). However, the adware category is consists of different families, including Airpush, Dowgin, Kemoge, Mobidash, and Shuanet. Finally, the author utilized Droidkin, which is a lightweight android apps similarity detector to find the relationship based on the category of each apps dataset (general malware, adware, and benign).
3.1.2 Review of Machine Learning Algorithm for Detecting and Combating security Threats and Attacks in the BYOD Environment.
Based on these research findings, different machine learning approaches have been employed in the implementation of BYOD security. A comprehensive summary of the findings is presented in Table 1, illustrating the different algorithms used by different researchers for implementing security threats and attacks in a BYOD environment. It is observed that certain authors use several algorithms, to determine which algorithm performs better. Consequently, Shah and Shankarappa (2018), employed multiple algorithms which include SVM, MLP, BN, and RF out of which the SVM algorithm outperformed the other three algorithms returning low false positive, low true negative, and highest performance accuracy. Thus, SVM stands tall in terms of BYOD security threats and attack implementation based on their findings. Chizoba et al. (2020) utilized SVM, DT, RF, and ensemble algorithms. Ensemble learning is used to combine the performance of the other three individual algorithms. However, the RF algorithm put up the best vote using the ensemble combination model. Similarly, Naive Bayes, RF, and SVM algorithms were adopted by Sokolova et al. (2017), for anomaly detection in BYOD environments. The authors reported only the results achieved using the NB model because it performed far better than the other schemes. Muhammad et al. (2017) modelled an intelligent filtering approach for BYOD security using K-means to isolate incidents towards uncovering different clusters of normal behaviours from abnormal behaviours in a BYOD environment. In a related study, by Muhammad et al. (2019), the author leveraged the Clustering-based Multivariate Gaussian Outlier Score (CMGOS) to identify irregular device behaviours. CMGOS constitute clustering and density approximation schemes. In clustering, the K-means algorithm was employed the while the density approximation used a multivariate Gaussian algorithm. The K-means scheme recorded some inconsistencies in the result. Hence, k-means was used to organize the limits (Centroid 1 and 2) and the outcome serves as input to the density approximation to implement the model. To control unauthorized access in the BYOD environment, Petrov and Znati (2018), laid hold on the artificial neural networks and decision tree algorithms to detect any un-authorize effort to get into delicate information by adversaries. In addition, the model further perplex or confuse their access to secure the data. Eslahi et al. (2016), leveraged the J48 form of Decision Tree (DT) to categorize the data and hence analyze the network behaviour. The J48 DT can proficiently detect recurring events in a mobile HTTP Botnet. Yerima et al. (2013), employed the Naïve Bayes classifier to identify malware in android devices. The authors noted that the Bayesian model is capable of performing both expert and learning schemes much better than other learning algorithms. In another study, Chen et al. (2016), used multiple learning algorithms including SVM, DT, ANN, NB, K-NN, and Bagging predictor to detect malware in an android environment. The performance result indicates that the KNN algorithm outperformed the other learning algorithms. RIASAT et al. (2017), adopted the use of SVM and random forest learning models to detect the behaviour of android malware. The authors noted with experimental facts that the RF algorithm produced a better result as compared to the SVM algorithm given the same processing time interval. The K-Nearest Neighbors algorithm was employed in the (Gangwal & Conti, 2019) study for categorizing time series to detect crypto converts in mobile environments. The model operates with or without access rights to the suspicious gadget.
3.1.3 Review of Evaluation Metrics that are employed to evaluate the performance of Machine Learning Algorithm for Detecting and Combating Security Threats and Attacks in the BYOD Environment.
Performance measures are the metrics that are used to evaluate the performance of machine learning classification on the BYOD security threats and attacks. The authors employed several evaluation metrics to ascertain the performance of BYOD models. Performance metrics such as accuracy, precision, recall, F-score, etc. were employed by researchers to evaluate the performance of the machine learning model on the BYOD security implementation. These metrics can be calculated by employing the values of false positive (FP), false negative (FN), true positive (TP), and true negative (TN), which constitute the components of the confusion matrix. The option of evaluation metric to be selected depends on the researcher’s aim and expertise. In this regard, Yerima et al. (2013), utilized several metrics including false negative, true positive, false positive, true negative, precision as well as accuracy and error rate. These metrics were used to measure the performance of the model at different folds and the authors maintained that 15 to 20 features can provide good performance. In another study, Eslahi et al. (2016), used the accuracy, detection rate, and false alarm evaluation metrics to assess the performance of the model. The metrics yielded 98.60, 96.35, and 1.25 percent results respectively. In a separate study, Chen et al. (2016), leveraged the true positives, false positives, ROC, precision, recall, and accuracy metrics to ascertain how well the model can detect malware in an android environment to assess the performance of the Android malware detection model. Aneja et al. (2018), utilized the accuracy evaluation metric to assess the performance of the model, which showed an overall accuracy of 86.7 percent. The study by Daniel et al, 2018, utilized recall, precision, and accuracy for evaluating the performance of the model. The model returns a reliable accuracy/precision performance result of over 99 percent at each run time. Shah and Shankarappa (2018) used TP, TN, FP, FN, and Accuracy metrics to assess the performance of the BYOD security model developed. In a separate study, Sokolova et al. (2017) relied on the true positives, false positives, false negatives and true negatives evaluation metrics to assess the performance of the BYOD security model built. Muhammad et al. (2019), leveraged outlier secure accuracy to ascertain the performance of its BYOD scheme. The performance result shows that for 9; 100; and 324 IAT points, 99.3%, and 0.7% outlier secure accuracy was achieved in normal and abnormal profiling respectively. In the same year, Arora and Bhatia (2019), employed the use of performance metrics such as false acceptance rate, false rejection rate, accuracy, and average classification error. For each evaluation metric and dataset, a corresponding performance result was achieved. Similarly, Standard evaluation schemes such as accuracy, precision, recall, and f1-score were adopted in the study conducted by (Gangwal & Conti, 2019) to assess the performance of a BYOD model. Precision and f-measure metrics yielded an average of 88 and 87 percent accuracy respectively. Accordingly, Chizoba et al. (2020), in their study to identify advanced persistent threats using ensemble classifiers, employed several evaluation metrics such as true positive, false positive, precision and recall, others include f1-score, MCC, ROC and PRC. The authors used all these aforementioned metrics to carefully assess the performance of the developed BYOD security scheme.
Based on some of the reviewed studies, it shows that most of the related studies employed accuracy, recall, precision, and f-measure to evaluate the performance of the machine learning model. However, employing only such metrics may not be enough due to the imbalance in the dataset in some cases. Thus, the best metric to evaluate the model in such an instance is AUC.
Table 4
The review summary of the machine learning approaches implementation for BYOD security threats and attacks
S/N
|
Author
|
Year
|
Datasets
|
ML
Approach
|
Threats /Attacks Detected
|
Performance Measure
|
Area of Attack
|
Database
|
1
|
(Shah & Shankarappa, 2018)
|
2018
|
MDM event log
|
SVM, MLP,BN, RF
|
Privacy breaches, Data leakage
|
ACC
|
Device
|
IEEE
|
2
|
(Chizoba et al., 2020)
|
2020
|
Network traffic
|
SVM, DT, MV ensemble
|
Persistent threat
|
PRE, REC, F-M, AUC
|
Network
|
Google scholar
|
3
|
(Sokolova et al., 2017)
|
2016
|
9512 application and set of malware
|
RF, SVM, NB
|
Malware
|
REC, F-M, AUC, ACC
|
Network
|
Science direct
|
4
|
(Muhammad et al., 2019)
|
2019
|
Test-bed data containing 94 files of packets IAT
|
K-means clustering
|
unauthorized access
|
ACC
|
Network
|
ACM
|
5
|
(Muhammad et al., 2017)
|
2017
|
IAT packets data
|
K-means clustering
|
Unauthorized access, data leakage, data theft
|
ACC
|
Network
|
ACM
|
6
|
(Mora et al., 2014)
|
2014
|
URL Session dataset
|
RF, J48, PART/NNge, Reduce and Pruning tree
|
Unauthorized access
|
ACC
|
Network
|
Google Scholar
|
7
|
(Ho, 2014)
|
2014
|
Smartphone sensor data
|
Manhattan distance classifier, RF, Gaussian Discriminant Analysis (GDA) SVM
|
Data Theft
|
FRR
|
Device
|
Google Scholar
|
8
|
(Shabtai et al., 2012)
|
2012
|
Event log
|
K-means, RR, DT, BN, NB
|
Malware
|
TPR, FPR, ACC, AUC
|
Device
|
Springer
|
9
|
(Kumar et al., 2020)
|
2020
|
Social network IoT nodes
|
DNN
|
Untrusted network
|
ACC
|
Device
|
Springer
|
10
|
(Arora & Bhatia, 2019)
|
2019
|
fingerprint benchmarks
|
DCNN
|
Spoofing attack
|
FAR, FRR, ACE and ACC.
|
Apps
|
Springer
|
11
|
(Samarathunge et al., 2018)
|
2018
|
Email dataset
|
KNN
|
Malware
|
ACC
|
Email application
|
IEEE
|
12
|
(Petrov & Znati, 2018)
|
2018
|
Phone records of subjects
|
ANN, DT
|
Malware
|
ACC, PRE, REC
|
Data
|
IEEE
|
13
|
(Eslahi et al., 2016)
|
2016
|
Mobile botnet dataset
|
DT
|
Botnet
|
ACC, false alarm
|
Device
|
IEEE
|
14
|
(Gangwal & Conti, 2019)
|
2019
|
Profiled Crypto Currency mining sample magnetic field data
|
KNN
|
Un-authorized access
|
ACC, REC, PRE, F-M
|
Device
|
IEEE
|
15
|
(Aneja et al., 2018)
|
2018
|
Device finger print packed
|
CNN
|
Spoofing attack
|
ACC
|
Device
|
IEEE
|
16
|
(Joshi et al., 2016)
|
2016
|
NSL-KDD dataset
|
SVM
|
DOS attack, intrusion attack
|
ACC
|
Device
|
IEEE
|
17
|
(Yerima et al., 2013)
|
2013
|
APKs made up of Benign apps and Malware samples
|
BN
|
Malware
|
ERR, ACC, TNR, FPR, TPR, FNR, PRE, AUC
|
Device
|
IEEE
|
18
|
(Chen et al., 2016)
|
2016
|
Malicious Apk files
|
SVM, C4.5, MLP, NB, K-NN, IBK) and Bagging
|
Malware
|
ACC
|
Apps
|
ACM
|
19
|
(RIASAT et al., 2017)
|
2017
|
Apk files, android malware public datasets
|
SVM, RF
|
Malware
|
ACC
|
Apps
|
Google scholars
|
20
|
(Sahs & Khan, 2012)
|
2012
|
Benign and malicious android applications
|
SVM
|
Malware
|
ACC, PRE, REC, F-M
|
Apps
|
IEEE
|
21
|
(Akhuseyinoglu & Akhuseyinoglu, 2016)
|
2016
|
Traffic data log
|
NB
|
Malware
|
ACC and kappa statistics
|
Mobile devices
|
IEEE
|
22
|
(Tan et al., 2020)
|
2020
|
Network traffic data logs
|
MLPs
|
Malware
|
ACC
|
Network
|
IEEE
|
23
|
(Kyriazis, 2018)
|
2018
|
Apache Spark dataset
|
K-means clustering
|
Malware
|
NIL
|
Cloud environments
|
IEEE
|
24
|
(Tout et al., 2019)
|
2019
|
Real-time generated dataset
|
LR, SVR, NN and DNN
|
Device overheads
|
RMSE
|
Device resources
|
Science Direct
|
25
|
(San Miguel et al., 2018)
|
2018
|
Drebin2 and Androzoo3 repositories
|
DT, SVM, KNN, and NB
|
Malware
|
ACC, PRE, REC, F-M
|
Network
|
ACM
|
26
|
(Temper et al., 2015)
|
2017
|
Biometric sample dataset
|
Fuzzy Rough Nearest neighbor
|
User privacy breaches
|
Equal Error Rate (EER)
|
Mobile devices
|
IEEE
|
27
|
(Wang et al., 2017)
|
2018
|
Network traffic
data
|
SVM
|
Malware
|
F-M
|
Devices
|
IEEE
|
28
|
(Chukka, 2020)
|
2020
|
APK files
|
MNB, RF, SVM
|
Malware
|
PRE, REC, F-M
|
Device, Application
|
Google scholar
|
29
|
(Kotak & Elovici, 2019)
|
2021
|
Network traffic data
|
Neural Network
|
Unauthorized access
|
ACC, PRE, REC, F-M
|
Network
|
Springer
|
30
|
(Bai et al., 2021)
|
2021
|
Network traffic data
|
CNN
|
Malware
|
ACC, PRE, REC, F-M
|
Mobile network devices
|
Google scholars
|
31
|
(Narayanan et al., 2018)
|
2018
|
Malware, Benign and Wild datasets
|
SVM, SMO
|
Malware
|
ACC, PRE, REC, F-M
|
Network
|
Springer
|
32
|
(Saracino et al., 2016)
|
2016
|
Genome, Contagio-Mobile, and VirusShare
Datasets.
|
LDC, K-NN,
MLP, PARZC and RBF.
|
Malware
|
NIL
|
Mobile Network devices
|
IEEE
|
33
|
(Li et al., 2018)
|
2018
|
Benign dataset
|
SVM, PART, Random Forest
|
Malware
|
ACC, PRE, REC, F-M
|
Network, devices
|
IEEE
|
34
|
(Narayanan et al., 2017)
|
2017
|
Malware and Benign datasets
|
SVM and RF
|
Malware
|
ACC, PRE, REC, F-M
|
Network, Devices
|
IEEE
|
35
|
(Zhu et al., 2017)
|
2017
|
Benign,
Malware
And VirusShare datasets
|
CNN,
RBM, DBN and RNN including Bayesian, SVM and MLP
|
Unauthorized access
|
PRE, REC, F-M
|
Networks
|
IEEE
|
36
|
(Pajouh et al., 2018)
|
2017
|
Malware and Benign samples
|
C5.0 and RF
|
Malware
|
PRE, REC
|
Network devices
|
Springer
|
37
|
(Malhotra & Bajaj, 2016)
|
2016
|
Malware sample data
|
ANN and K-Means
|
Malware
|
ACC, PRE, REC
|
Devices
|
Springer
|
38
|
(Das et al., 2015)
|
2016
|
Malware sample
|
J48, NB, LR, SVM, SMO, JRIP, MLP
|
Malware
|
AUC
|
network
|
IEEE
|
39
|
(Lashkari et al., 2017)
|
2017
|
127 benign app datasets and 400 malware datasets
|
KNN, RF, DT, RANDOM TREE and LR
|
Malware
|
ACC, PRE, FPR
|
Application
|
IEEE
|
40
|
(Anwar et al., 2016)
|
2016
|
UNBISCX public datasets
|
SVM, KNN, J48, BAGGING, NB, RF
|
Botnet
|
TPR, FPR, ACC
|
Application
|
IEEE
|