Our approach is based on many steps; we will define them one by one. So, we have proposed a deep learning model for intrusion detection in the IoT network based on ensemble learning techniques by combining CNN and DNN models using Edge-IIoTset for training those models and comparing the obtained CNN and DNN models with each other and how to deal with an unbalanced dataset. So in the first part, we are going to do some preprocessing on the used dataset, then in the second part, we will create some CNN and DNN models with different architectures, and we are going to focus on the differences between those architectures (activation function and layer numbers) and compare them (train them without bootstrapping technique for the first time to ensure the comparative study). After this we are going to apply the class weights technique so we will give for each class a weight to make her more detectable in the training phase and compare the obtained results. In the final phase, having assigned weights to each class, we aim to enhance accuracy and prediction by proposing an ensemble learning model. This involves employing Bagging techniques, training models through bootstrapping, aggregating predictions, and utilizing majority voting for the final prediction. This comprehensive approach integrates preprocessing, individual model analysis, and ensemble strategies to achieve improved intrusion detection performance in IoT networks.[6]
A. Dataset
In our research, we utilized the Edge-IIoTset, which is a cyber-security dataset designed for intrusion detection using machine learning in the context of IoT and IIoT systems. This dataset was generated using a purpose-built IoT/IIoT tested that includes a large representative set of devices, sensors, protocols, and cloud-edge configurations. The Edge-IIoTset proposes 61 features that can be used for training and testing intrusion detection models. The dataset was created specifically for industrial IoT systems and includes network traffic collected from a diverse set of IoT devices and digital sensors. The dataset is intended to be used for detecting various types of attacks, including denial-of-service (DoS) and distributed denial-of-service (DDoS) attacks, information gathering attacks, man-in-the-middle (MitM) attacks, injection attacks, and malware attacks []. By using this dataset, we aimed to improve the accuracy and effectiveness of intrusion detection systems in industrial IoT environments. The dataset itself is quite large, with dimensions of 157800×63, containing approximately 10 million elements.
The edge-IIoTset dataset is often large-scale, enabling comprehensive analysis and facilitating the development of scalable solutions that can handle the volume of data generated in IIoT deployments.
This dataset contains 15 classes, but this dataset has unequal data for each class. For example, the normal class contain the most data, it represents 71.43% of the dataset, while some classes don’t pass 0.03%, like “fingerprints”, which can make some classes invisible for the model or be detected as other attack. So, this is what pushes us to think of methods to make our model consider this class, and detect it correctly and know its effect on the accuracy of our model.
B. Preprocessing
In the initial phase of data preparation, we engaged in several essential data transformations. Specifically, we addressed the columns 'http.request.method,' 'http.referer,' "http.request.version," "dns.qry.name.len," "mqtt.conack.flags," "mqtt.protoname," and "mqtt.topic." To accomplish this, we employed the encode_text_dummy technique, specifically designed for handling qualitative or categorical data that cannot be directly interpreted by statistical algorithms. This technique effectively converts such data into binary variables, typically taking values of 0 or 1, signifying the presence or absence of a particular category within the respective variable. In the preprocessing phase, the raw data undergoes various transformations to make it suitable for deep learning algorithms. These steps encompass feature selection, encoding categorical features, and scaling the features.
1) Feature selection
Feature selection is a crucial step in the data preprocessing phase, aiming to enhance the efficiency and performance of machine learning models by focusing on the most relevant and informative features. In the provided example.
• Drop Function:
- The removal of these features was executed using the `drop` function from the pandas library. The `drop` function is a pandas method that allows the removal of specified columns or rows from a Data Frame. In this context, it was employed to eliminate the mentioned features from the dataset.
• Streamlining the Dataset:
- Feature selection is performed to streamline the dataset, ensuring that it contains only the most pertinent information for model training. This process helps reduce dimensionality, mitigate the risk of over fitting, and enhance the model's generalization to new, unseen data.
• Removing Irrelevant Information:
- By excluding certain features, the goal is to eliminate irrelevant or less informative data points that may not significantly contribute to the model's ability to make accurate predictions. This results in a more focused and efficient dataset.
2) Splitting Data into Features and Labels
In this stage we are going to split data into features and labels, which is a crucial preparatory step. Using this code “X = df.drop ([label_col], axis = 1)” and “y = df [label_col]”to achieves this separation, with X representing the input features and y representing the output labels.
This separation is essential for supervised learning, where the model is trained to understand the relationship between the input features and the corresponding output labels. The process involves utilizing existing data to teach the model patterns and associations, enabling it to make predictions on new, unseen data
The reason for splitting the data into features and labels is rooted in the foundational principle of supervised learning. It acknowledges that the model needs to learn from historical data where the input features (X) are associated with known output labels (y). This separation facilitates the training process, allowing the model to understand the underlying patterns in the data and develop the ability to generalize its predictions to new, unseen examples.
3) Train-Test Split
This is a critical step in preparing the dataset for machine learning. This process involves dividing the dataset into two subsets: one for training the machine learning model and another for evaluating its performance.
• For training models (no ensemble learning)
Using the train_test_split function from scikit-learn to achieve this purpose. It takes the input features (X) and corresponding labels (y) as its parameters.
test_size = 0.2
This parameter specifies the proportion of the dataset that should be reserved for testing. In this case, 20% of the data is set aside for evaluation, while the remaining 80% is used for training.
random_state = 1
This parameter ensures reproducibility by seeding the random number generator. The same random seed guarantees that the data split remains consistent across multiple runs of the code.
Stratify = y
This parameter ensures that the class distribution in the original dataset is preserved in both the training and testing sets. Stratified sampling is crucial when dealing with imbalanced datasets, where certain classes may be underrepresented.
4) Label Encoding
Encoding Categorical Features: Subsequently, in the second step, we employed the LabelEncoder class from the sklearn.preprocessing library to encode categorical features within the resnet dataset. This transformation is crucial since it converts categorical data into numerical form, a prerequisite for efficient functioning of deep learning algorithms
5) Min-Max Scaling
In this stage, the Min-Max scaling technique is applied to normalize the feature values in the dataset. Min-Max scaling is a preprocessing method that transforms numerical features to a specific range, typically between 0 and 1. This is achieved by adjusting the values proportionally based on the minimum and maximum values observed in the original dataset.
The `MinMaxScaler` from scikit-learn is employed, creating an instance called `min_max_scaler`. The `fit_transform` method is then used on the training set (`X_train`), which calculates the minimum and maximum values for each feature in the training data and scales the values accordingly. The same scalar is then applied to the test set (`X_test`) using the `transform` method, ensuring that both the training and test sets are scaled consistently.
Min-Max scaling is particularly beneficial for machine learning models, especially those based on distance metrics or optimization algorithms, as it helps prevent features with larger scales from dominating the learning process. By bringing all features within a uniform numeric range, Min-Max scaling contributes to improved model convergence and performance.
6) One-Hot Encoding for Labels
In this stage, one-hot encoding is applied to transform categorical labels into a binary vector representation, enhancing the model's ability to understand and learn from multiple classes efficiently. This technique is particularly essential when dealing with classification tasks involving more than two categories.
The to_categorical function from TensorFlow's Keras utility is employed for this purpose. It automates the process of converting numerical categorical labels into their one-hot encoded equivalents. This binary representation facilitates better model convergence and performance, especially in scenarios where the classes are not inherently ordered, and their relationships are categorical rather than ordinal.
C. Deep learning
• CNN 1
In first place we created CNN model, which takes an input_shape parameter representing the shape of the input data which is 92. The model is then built layer by layer. The model then applies two 1D convolutional layers with 32 and 64 filters respectively, using a 'ReLU' activation function and maintaining the input size with padding. Each convolutional layer is followed by max-pooling to reduce spatial dimensions. After this, the output is flattened to prepare for fully connected layers. The network then consists of four dense layers with progressively decreasing units (512, 256, 128, 64) and 'ReLU' activation functions. The final output layer employs a SoftMax activation function with 15 units, indicating a multi-class classification task. The model is trained using categorical cross-entropy loss and optimized with the Adam optimizer. Overall, this architecture is well suited for tasks involving sequential data, such as time series or signal processing applications.
• CNN 2
Also, we have created another CNN model alternative, which take some input shape, then It employs two convolutional layers with 64 and 128 filters, using 'ReLU' activation functions followed by Leaky ReLU activations with an alpha value of 0.1. Max-pooling operations reduce spatial dimensions. The output is flattened for fully connected layers. Three dense layers with 256, 128, and 64 units, along with Leaky ReLU activations, follow. The final layer uses a SoftMax activation with 15 units for multi-class classification. The model is trained with categorical crossentropy loss and the Adam optimizer. This architecture excels at capturing intricate patterns in sequential data, particularly in tasks like time series analysis or signal processing.
• DNN 1
A first DNN model starts with an input layer, taking in data of this dimension. The network then contains four hidden layers with 512, 256, 128, and 64 units respectively, all using ReLU activation functions. Finally, there's an output layer with 15 units, employing a SoftMax activation for multi-class classification tasks. This architecture is suitable for tasks that involve transforming data with 92 features into predictions across 15 different classes.
• DNN2
The second model is structured with eight dense layers. The first layer takes input with 92 features, using a ReLU activation. Subsequent layers contain 512, 256, 128, 64, and 32 units, all employing ReLU activations. The final output layer has 15 units with a sigmoid activation
• DNN3
The third DNN model takes the same input, and has the same architecture as the second DNN, just adding one layer after the input contains 1028 units.
Finally, the model is compiled using "binary_crossentropy," commonly used for solving binary classification problems. The Adam optimizer is employed to improve the model's weights and biases during training. Model performance is evaluated based on accuracy metrics.
D. Class weights
Class imbalance is a typical difficulty in binary classification tasks, as it frequently impedes accurate predictions in deep learning algorithms. When one class substantially dominates the other in terms of data samples, class imbalance arises, which skews prediction. In this case the clever use of class weights is one method for resolving class imbalance. By giving the minority class greater weights, class weights lessen bias towards the majority class and enable the model to focus more on its patterns.
Accordingly we have chosen this technique to deal with our imbalanced dataset, So we gave the minority class a higher weight so they can be more detectable by ours models in the training, such as “Fingerprinting”, “XSS” and “Port_Scanning” which are the lowest classes in our dataset.
In this stage we assigned weight to each class. These weights are calculated using scikit-learn's compute_class_weight function with balanced parameters. The purpose of these weights is to compensate for imbalances in the dataset when some classes have fewer examples than others. By assigning higher weights to underrepresented classes and lower weights to overrepresented classes, the training process can be adjusted to give more weight to less common classes. This helps the model learn and predict better from imbalanced data, ultimately improving its overall performance and fairness across different categories.
E. Bagging:
In our approach, we utilize ensemble learning represented as Bagging (B), wherein we combine Convolutional Neural Networks (CNNs) and Deep Neural Networks (DNNs). Initially, we partition the dataset (D) into five (N = 5) random subsets (D1, D2, D3, D4, D5) using bootstrapping, denoted as {Di}, and each subset is employed to train individual models represented as M1, M2, M3, M4, M5. Subsequently, we aggregate predictions (Pi) from these models using a majority voting mechanism (MV), which selects the class most frequently predicted (C*) to enhance the accuracy of our intrusion detection system (IDS):
IDS = MV(P1, P2, P3, P4, P5) = C*
Where C* represents the selected class, based on the majority vote, ultimately enhancing the accuracy of our intrusion detection system.
• Bootstraping
The provided code snippet illustrates the process of generating bootstrapped samples for training multiple models. Bootstrapping is a resampling technique where random samples, known as bootstrap samples, are drawn with replacement from the original dataset. These samples are then used to train individual models, enabling an ensemble approach to machine learning. Let's break down the code in a paragraph:
In this code, the goal is to create bootstrapped samples for training a specified number of models (num_models). The size of each bootstrapped sample is determined by the variable sample_size, which is set to 80% of the total number of data points (num_data_points). The loop iterates through the number of models to be trained, and for each iteration, a random set of indices is drawn with replacement using NumPy's np.random.choice function. This set of indices represents the indices of the data points to be included in the current bootstrapped sample.
• Majority Voting
The Majority Voting function employs the Counter class to tally the occurrences of each class in a given list of predictions. Utilizing `Vote_counts.most_common`, the function identifies and returns the class with the highest count, representing the majority vote. This mechanism ensures that the final prediction is determined by the most frequently predicted class across the contributing models, thereby consolidating diverse predictions into a single, robust outcome