Intelligent IoT security monitoring based on fuzzy optimum-path forest classifier

Detection of intrusions in Internet of Things networks is essential to maintain the availability and integrity of the data generated and transmitted by connected devices. Such a procedure is paramount when the data originate from critical activities, such as military, financial, industrial, and health sectors. In the last decades, machine learning (ML)-based approaches have become one of the most suitable and adopted procedures for the task, providing automatic, fast, and accurate results. Despite such success, the literature still presents a gap regarding valid applications of intrusion detection in the IoT environments, which usually stands for a challenging task composed of different types of attacks. In this context, this work applies a recent technique based on graphs and logic fuzzy, namely Fuzzy Optimum-Path Forest (Fuzzy OPF), to detect threats that escape an IoT network’s regular traffic. We evaluate our model against five well-known ML algorithms, i.e., Linear Discriminant Analysis, Support Vector Machine, Naive Bayes, K-Nearest Neighbors, and the standard Optimum-Path Forest. Experimental results show that Fuzzy OPF outperforms the baselines considering accuracy, recall, and F1 metrics. As a result, the Fuzzy OPF proposal for intrusion detection had a hit rate of 98 and 99%.


Introduction
Internet of Things (IoT) refers to a vast number of connected electronic devices capable of transmitting and collecting data over the internet (Almohri et al. 2020;Sarkar et al. 2015). The concept is applied to monitor environment-related events and industrial activities, collect information associated with human behavior, and provide information for military operations (Butun et al. 2019). Despite such advantages, the data generated in the IoT environment are subject to integrity risks since it can be easily manipulated and destroyed by malicious attacks (Lv et al. 2020), described as (i) physicals attacks, such as node tampering and attack, (ii) software attacks, e.g., code injection and data privacy issues, and (iii) network attacks, e.g., Sybil attack and blackhole.
In the last decades, many researchers employed machine learning (ML)-based approaches to tackle similar drawbacks related to malicious attack detection considering several fields, such as image recognition (Moreira et al. 2022;Yao et al. 2020), deep fake detection , and remote sensing (Huang et al. 2015;Santos et al. 2021), to cite a few. These works aim at finding solutions to prevent, detect, or mitigate attacks on complex networks. Moreover, several works available in the literature address the problem of intrusion detection through ML algorithms (Nugroho et al. 2020;da Costa et al. 2019;Al-Garadi et al. 2020;Saranya et al. 2020;Liu et al. 2020;Shaver et al. 2020), exposing the vulnerability of IoT networks and their limitations, as well as the impractical issues related to traditional security methods, such as encryption, firewalls, or Intrusion Detection Systems (IDSs).
Among a wide variety of machine learning methods, a graph-based framework called Optimum-Path Forest (OPF) (Papa et al. 2009) obtained considerable popularity in the last decade due to its successful applications over a wide variety of applications, ranging from medicine Ribeiro et al. 2015) to engineering (Passos et al. 2016), to cite a few. Moreover, the framework is flexible enough to be adapted for different tasks, such as anomaly detection (Passos et al. 2016;Guimaraes et al. 2019), and data imbalance (Passos et al. , 2020. Recently, Souza et al. (2019) and de Souza et al. (2021) proposed an improved variant that combines the OPF classifier with fuzzy logic, namely Fuzzy OPF, obtaining satisfactory results over a variety of applications. However, as far as we know, Fuzzy OPF has never been employed to tackle the problem of intrusion detection in IoT networks. Therefore, the main contributions of this paper are presented as follows: -to evaluate the behavior of Fuzzy OPF for the task of detecting intrusions in IoT environments; -to foster the literature in the context of graph-based algorithms, fuzzy applications, and intrusion detection in IoT Networks.
The remainder of this paper is organized as follows. Section 2 provides a detailed review of related work, while Sect. 3 describes the Fuzzy OPF algorithm. Further, in Sect. 4 the methodology employed in work, which comprises the data sets and experimental configurations, is described. Finally, Sects. 5 and 6 state the experimental results and conclusions, respectively.

Related work
Despite the relevance of the field, the literature presents a relatively low amount of datasets and research work focused on security in IoT networks. Cheema et al. (2020) introduced an intrusion detection system based on distributed machine learning using Blockchain and Support Vector Machine (SVM), while Chkirbene et al. (2020) combined Random Forest with Classification and Regression Trees to classify different types of attacks. Further, Ghazi and Moulay Rachid (2020) proposed a cloud system for realtime intrusion detection and monitoring of communication and attacks before they spread across the network, while Alalade (2020) used Extreme Learning Machine and Artificial Immune System (AIS-ELM) to build an IDS to detect network anomalies. Maniriho et al. (2020) implemented an anomaly-based approach using a resource selection mechanism. The proposal uses the Random Forest algorithm to classify traffic as normal or anomalous over IoTID20 dataset, achieving an accuracy of 99.9% in detecting DoS attacks.
Similar works Ghosh et al. (2019), Vikram and Mohana (2020) and Guimaraes et al. (2019) highlight the problem of outliers and the importance of good quality data for the task, especially considering IDS constructions, as anomalies can considerably degrade the performance, affecting the final decision. Following the same line, Arshad et al. (2020) proposed an intrusion detection framework for the energyconstrained IoT devices that form the basis of an Industrial IoT (IIoT) ecosystem, while Hassan et al. (2021) proposed a cooperative data generator based on a trained downsampler encoder using ML and Deep Learning (DL) techniques to ensure better performance in IIoT environment. Finally, Magaia et al. (2021) used deep reinforcement learning for IIOT in smart cities, in addition to recurrent neural networks and convolutional neural networks.
Regarding DL algorithms, Swarna Sugi and Ratna (2020) presented an IDS model based on DL and ML to overcome security attacks in IoT networks. The authors propose a model that combines k-Nearest Neighbor In the context of fuzzy-based approaches, Cristiani et al. (2020) proposed the Fuzzy Intrusion Detection System for IoT Networks (FROST), which uses the basis of fuzzy theory to make learning models more flexible and improve the performance in the classification of inaccurate data. Naik et al. (2017) built a dynamic fuzzy rule interpolation (D-FRI) approach to enhance the Fuzzy rule interpolation (FRI) model that works with static rules. D-FRI was employed to support network security analysis in constructing an intelligent intrusion detection system (IDS). Manimurugan et al. (2020) presented an algorithm based on the combination of Crow Search Optimization (CSO) and Adaptive Neuro-Fuzzy Inference System (ANFIS) techniques.

Fuzzy optimum-path forest
The Fuzzy Optimum-Path Forest (Souza et al. 2019) is an OPF variant that improves sample selection and classification performance through fuzzy logic. Such an approach also alleviates some problems related to noise, class imbalance, and outliers. The main idea is to calculate a degree of participation or a fuzzy association with a particular class for each sample in the training set. In a nutshell, a clustering step is performed using the unsupervised variant of the OPF framework (Rocha et al. 2009). This step also computes the density of the samples regarding their respective clusters. Further, this density is considered to attribute the node a membership value, which is incorporated in the cost function of the Fuzzy OPF classifier.
Suppose a Graph G = (N , A), where N composes the training nodes and A characterize the set of edges that connect each pair of training samples. The density of a given sample (q) is computed using the probability density function (PDF) as follows: where A k (q) stands for the k-Neighborhood of sample q, is the Euclidean distance between the nodes q and u, and d f is the highest distance between the edges of the graph (N , A k ).
The Fuzzy membership F Θ (q) ∈ [0, 1], where Θ = {σ, ρ min , ρ max }, assigns a real value to each sample q, thus defining membership for the sample regarding its respective class. An adequate membership function should consider the following restrictions: (i) a lower limit hyperparameter σ > 0 and (ii) the ability to describe the behavior and properties of samples (Lin and Wang 2002). This work employs the following equation to compute the Fuzzy OPF membership function: where ρ min ≤ ρ(q) ≤ ρ max , and ρ min and ρ max set the lowest and highest densities, respectively. Briefly, we can say that the instances situated at the borders of the clusters present a lower density value, usually receiving small values of membership, i.e., they retain a lower "strength" in the conquest process. Such behavior implies a penalization for samples located far from the clusters' centers, thus helping the problem of over-adjustments. Consequently, the most significant examples, i.e., the instances with higher membership values, become more relevant in the conquering process, providing the best path cost for the remaining samples. The process is conducted through the path cost function f max , as follows: (3) where T defines the set of prototypes, φ q corresponds to a path of connected samples with root node in T and final node q, and d(q, u) is the distance, or the cost to connect samples q and u. Further, φ q · q, u denotes the concatenation between the path φ q and the connection between nodes q and u.
Conjecture T * ⊆ T as a set of prototypes that minimizes the errors of the training step 1 . The Fuzzy OPF in its training process assigns for all sample u ∈ N an optimal cost P(u), as provided below: The optimal cost is used to perform the conquering process in the training and testing stages. It is important to emphasize that low fuzzy membership values represent samples with little relevance to the training stage, whereas examples with high membership values are more representative.
Note that when F θ x ≈ 0 in Eq. (2), P(x) assigns a value equal to 0 in Equation 4, leading to an irrelevant capacity of conquering (Souza et al. 2019). To avoid such behavior, this work assumes sigma values within the range [0.2, 1.2]. The implementation of the proposed model is presented in Algorithm 1.
The algorithm receives as input a graph G = (N , A), a set of prototypes T ⊆ N , a map of training set labels λ, and the lower bound parameter σ . The outputs are a predecessor map O, a path-cost map P, and a label map C. Five auxiliary variables are used: a priority queue L, a variable cst, a density map ρ, and the minimum and maximum densities ρ min , and ρ max , respectively.
Remove from L a sample q such that P(q) is minimum; Lines 1 − 4 present the density calculation for each sample, initialize predecessor and cost maps, and all samples have their fuzzy association values calculated, while Line 5 computes ρ min and ρ max , which are used in Eq. (2). Lines 6-7 initialize the prototype cost with zeros, set their true labels, and introduce the prototypes in the priority queue.
The main loop is defined by lines 8-16, corresponding to the competition process. Since the prototypes have a zero cost, they are the first to be taken from priority queue L in line 9. The loop from lines 10-15 is repeated for each sample, which calculates the best path cost of the training samples line 12. If the node is conquered, it is removed from the priority queue (lines 13-14). Lastly, line 15 updates the predecessor map with the cost value of all samples, assigning each node to the label of the prototype that conquered it.

Proposed approach
The approach employed in this work considers a network topology where all equipments are connected to a concentrator device (switch), which conducts all data traffic. Further, an IDS procedure acts as a Monitoring Server which runs the OPF Fuzzy sorter.
This server collects all network data traffic and the Data Collector module generates a dataset to train the Fuzzy OPF. Samples from this dataset are labeled intrusions and nonintrusions. Later, these resources are used to generate alerts on intrusion detection. If there is an infected device on the network or an external threat (cyber attack), the system will generate alarms so that a person responsible for the system decides how to proceed. Fig. 1 illustrates the intrusion

Methodology
This section presents the datasets, experimental setup, and metrics employed in this work.

Dataset description
The experiments provided in this paper are conducted over two well-known datasets for attack detection over IOT networks, namely Austin Texas 2018. 2 . The datasets follow a specific pattern of fake data injection attacks in IOT networks, and their attributes are described in Table 1.
Regarding the datasets used in this article, we emphasize that they underwent a normalization process, where each column of the dataset was normalized to values between 0 and 1 using the following equation: where z i is the normalized value of ith in the dataset; x i is the value ith in the dataset; min(x) the minimum value in the dataset; max(x) the maximum value in the dataset.

Experimental setup
The experiments cover the comparison of Fuzzy OPF with five traditional classifiers: Optimum-Path Forests (OPF), Support Vector Machine (SVM), Naive Bayes classifier (Bayes), k-Nearest Neighbors classifier (k-NN), and Linear Discriminant Analysis classifier (LDA). For a proper evaluation, the experiments comprise a cross-validation step, in which the datasets are randomly split into 70% for training, 15% for evaluation, and 15% for testing steps. The procedure is repeated for 20 folds for a better statistical analysis.
Finally, the implementation of OPF and Fuzzy OPF was carried out using the LibOPF 3 library. Further, k-NN, Naive Bayes, SVM, and LDA classifiers were implemented using Scikit-learn (Pedregosa et al. 2011). The experiments were conducted on a machine with 6Gb of RAM running an Intel ® Core TM i3− M380CPU@2.53G H z×4 and the Linux Ubuntu 20.04.2 operating system Version 64 bits.

Statistical metrics
This article employs five metrics for a more concise assessment of results obtained by the Fuzzy OPF over the task of intrusion detection in IoT networks, described as follows: -Accuracy: is a general probability, i.e., the global average, of the correctness of the predictions. -Recall: is used to indicate the relationship between positive predictions made correctly and all predictions that are positive. -Intrusion: is the probability that samples characterized as intrusions are, in fact, intrusion samples. -Normal: is the probability that samples characterized as normal are, in fact, normal samples. -F1 Score: is the harmonic average between precision (Intrusion) and recall (Normal), thus appropriate for imbalanced datasets evaluation.

Experimental results
This section presents the experimental results and statistical analysis, as well as a brief discussion regarding such values. Further, it also provides the procedures adopted for Fuzzy OPF hyperparameter fine-tuning. First, the Fuzzy OPF performance is compared against the SVM, k-NN, Naive Bayes, and LDA techniques. In this context, Fig. 2   Due to the proximity of the Fuzzy OPF and OPF results, Tukey's statistical test was used to verify the significant difference in accuracy between the classifiers. Figure 3a shows no significant difference between Fuzzy OPF and OPF in Scenario 1 , even though both present superior results concerning k-NN, Naive Bayes, SVM, and LDA. Regarding Scenario 2 , Fig. 3b shows no significant difference between Fuzzy OPF, OPF, and k-NN, which are superior to Bayes, LDA, and SVM.
Further, we investigate the model's discrimination rate between the intrusion and no intrusion classes. In this context, Fig. 4 shows the Fuzzy-OPF and OPF confusion matrices for the two evaluated scenarios. The matrices show positive cases above 98.16% for intrusion packages and 87.73% accuracy for no intrusion packages in Scenario 1 . Regarding scenario 2 , the true positive cases are above 99.24% for intrusion packages and 93.61% for no intrusion packages (false positive cases). Table 2 presents the computational load (in sec) of each technique. As expected, Fuzzy OPF demanded more expressive computational resources, which was expected due to the unsupervised step of the OPF employed to compute the membership function whose implementation considers several repetitions to find the best graph cut. On the other hand, when considering the test step, Fuzzy OPF has a satisfactory result, performing even faster than the standard OPF, SVM, and k-NN over Austin Texas-Scenario 1 dataset. Regarding Austin Texas-Scenario 2 dataset, Fuzzy OPF obtained a time statistically equal to the default OPF and faster time than the other classifiers. Such a good efficiency over the testing stage suggests that Fuzzy OPF is suitable for IOT networks intrusion detection on low-power embedded computational devices.
This feature makes Fuzzy OPF suitable to be an intrusion detector for IOT networks, as it can achieve good results with qualified computational resources.

Ablation
This section presents Fuzzy OPF hyperparameter optimization. In this context, Fig. 5 depicts a grid-search procedure, where the possible arrangements of sigma and k max are considered to provide the best results on the validation sets over Austin Texas-Scenario 1 (a) and Austin Texas-Scenario 2 datasets.
Regarding the Austin Texas-Scenario 1 , Fig. 5a shows that the most accurate results were obtained in the intervals of σ ∈ [0.6, 0.8] and k max ∈ [100, 150]. Concerning Austin Texas-Scenario 2 datasets in Fig. 5b, one can see that Fuzzy OPF obtained the best results considering σ ∈ [0.6, 0.8] and k max ∈ [50, 150]. Such behavior leads to the following conclusions: 1. When σ is equal to 1.0, the Fuzzy OPF converges to the Standard OPF, i.e., in the worst case, it gets results as good as the OPF classifier. 2. Fuzzy OPF can find better alternatives for more complex scenarios, such as in cases with a reduced number of resources or anomalous instances.

Conclusion
This paper introduces the Fuzzy Optimum-Path Forest to the task of intrusion detection in the IoT network. The approach considers a supervised binary classification task (intrusion and no intrusion packets) over two IoT network traffic public datasets, i.e., Austin Texas-Scenario 1 and Austin Texas-Scenario 2 .
Experimental results evaluate the model's performance against five different ML algorithms: SVM, k-NN, LDA, Naive Bayes, and OPF, where both the Fuzzy OPF and the standard OPF obtained the best results overall considering the task of intrusion detection. In this context, Fuzzy OPF reached 99.24% of true positive cases with the best time performance during the testing step, demonstrating the model's suitability for energy-constrained embedded devices for IoT network threats and abnormality detection.
Regarding future work, we aim to extend the model to classify the type of threat instead of binary classification between intrusion and no intrusion packets. Data Availability Enquiries about data availability should be directed to the authors.

Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.