In computer systems, especially with the advancement of the Internet and databases, big data is increasingly expanding and is advancing exponentially
data is increasingly expanding and is advancing exponentially. [1] Technology has evolved exponentially in the last few decades. In addition to having advantages, technology also brings security threats. The protection of modern networks and the Internet is done in order to prevent the penetration of hackers and cyber-attacks. To safety, modern networks, various cyber security methods and protection system such as firewalls, authentication techniques, encryption methods and intrusion-detection systems was introduced to monitor network traffic.
Intrusion detection is an important issue in the network, although significant progress has been made in this field, but there are still many opportunities to improve detection methods and prevent network-based attacks.
Because cyber-attacks are becoming more and more dangerous and their detection is becoming more complicated, the process of security types of attacks and recognizing their patterns is a vital step in network security frameworks.
It is considered effective identification of cyber-attacks is considered a fundamental challenge for network users and managers, especially in modern networks that are evolving rapidly with the advancement of technology. Therefore, to improve the accuracy of intrusion detection and defense against more attacks, many advanced and new techniques of machine learning and automatic machine learning and data mining can be used.
In recent years, different artificial intelligence algorithms and methods have been used in the design of IDS, among which several types of fuzzy methods, genetics, artificial intelligence, etc. can be mentioned. Since speed and accuracies are very important in the design IDS, the combination of genetic algorithm and fuzzy logic seems to be a suitable option to solve this problem. The scope of work of genetic algorithm is very wide and every day with the progress of science and technology. The use of this method in optimization and problem solving has been expanded.
Utilizing machine learning, exploitation information theory, mathematical models are built that can be used for inference. [2] Machine learning techniques are suitable in situations where there is no prior knowledge about data patterns, that's why sometimes these methods are also called bottom-up. The important advantage of this method is that there is usually no need for expert people to determine the desired requirements in order to detect intrusion, for this reason they act very quickly and are affordable. Machine learning techniques are generally divided into two categories, supervised and unsupervised.[3] Today's solutions used for intrusion detection have faced the designers of intrusion-detection systems with difficulties in choosing the type of architecture that can have more reliability in detecting intrusions. In fact, intrusion detection is one of the most important issues raised in security fields. For this reason, the need to find the fastest and most accurate solution for penetration detection prompted researchers to conduct extensive research in this regard [4]
Identifying malicious network intrusions has been a subject of study for decades. However, as data scientists can understand, when the scale of a problem increases by an order of magnitude, existing approaches are often no longer effective. [5, 6] The problem is so different that it needs a new solution, and since the volume of network traffic is increasing day by day, the field of intrusion detection is forced to reinvent itself around big data techniques. [7] An intrusion-detection system monitors networks or other systems for malicious or unusual behavior. By completing preventive technologies such as firewalls, strong authentication and privilege [8] Intrusion-detection systems have become an essential part of organizational information technology security management [9]. These systems are usually classified into two categories under the title of abuse-based or anomaly-based systems. [10] Data mining techniques are increasingly used to detect attacks, anomalies or intrusions in a protected network environment. [11]
In this study, the proposed method was considered in the first stage, the pre-processing is done by normalizing and digitizing the data set, as well as removing outliers based on two PCA methods and reducing the dimensions of the feature, then using the learner. We use the k-means algorithm to find the optimal number of clusters, and finally we use the Elbow method to find the optimum number of clusters. The second stage consists of classifying malicious and normal network traffic from each other by combining K-means and XGBoost algorithms on computing platforms. The main structure of the paper is as follows.
Related work
The subject of research [12] is the use of process mining in host-based intrusion-detection systems. In this article, the authors state that many organizations have moved from data-oriented to process-oriented systems and use process-based information systems to improve their efficiency. It includes pre-processing stages, two parallel stages of abnormality diagnosis and abuse diagnosis and combination of results.
In the study [13], a modern system for feature selection and alarm management that has the capability of active execution was simulated. The obtained results compare the fresh system with other alarm management methods and show that this system has the speed, accuracy and efficiency are much higher than other alarm management methods based on clustering. Additionally, according to the obtained results, it can be seen this system is able to actively classify the alarms of intrusion-detection systems is. The innovation of this research is the use of all the existing algorithms in the classification methods that are available in the Veka software and the proposal of five data samples that are extracted from the primary data and give the best answer for different models and algorithms? The author’s state research aimed at determining the status of the thyroid gland in terms of normality, hyperactivity or hypothyroidism using data mining methods. The predictive model for the classification of thyroid disease has been performed after data pre-processing using supervised and unsupervised machine learning methods. This study is of an analytical type and the database. It contains 215 independent records based on five continuous features and collected from the UCI machine learning data reference. [14] The aim of this research is to reduce the error of thyroid disease diagnosis, which the use of data mining methods helps to reduce this error. In addition, in this article, the diagnosis of thyroidal disease is made with the help of different pattern recognition methods. The results show that the fuzzy neural model has the least amount of error and the most accuracy.
The research [15] is aimed at increasing the accuracy of intrusion detection in the Internet of Things by the support vector machine improved by the grasshopper optimization algorithm. In this research, firstly, he collects the data of the intrusion-detection system in the Internet of Things, and after cleaning the data by using the central average statistics, linear normalization of the data is performed and with Fisher's discriminant analysis algorithm, feature selection is performed and five features out of 41 features were selected. Then, the support vector machine was improved using the grasshopper optimization algorithm, and the results were compared and concluded using the bagging classifier and the K-nearest neighbor classifier. The experimental results using simulators and training data show that the above proposed model has a better performance than bagging and k-nearest neighbor classifications in terms of error statistical analysis. Additionally, in terms of accuracy, the comparison with the improved SVM algorithm with gray wolf and particle swarm optimization shows the better accuracy of this method. In addition, due to the use of algorithms without much complexity and low resource consumption and faster speed, it is a suitable method for detecting attacks in IOT.
In the study [16], an IOT intrusion-detection model based on the light gradient amplification machine is proposed. In the first step, one-dimensional convolutional neural network is used to extract features from network traffic to reduce feature dimensions. Then, the light gradient amplification machine is used for classification to detect the type of network traffic. Based on inheriting the advantages of the gradient boosting tree, the light gradient boosting machine is slighter and the building process in the decision tree is faster. Experiments are conducted on TON- IoT and BoT-IoT data sets, which show that the proposed model has a stronger and lighter performance than the comparison models. It can shorten the prediction time by 90.66%. The experimental results on the test platform built with IoT devices such as Raspberry Pi show that the proposed model can perform effective and real-time intrusion detection in IoT devices better.
In the research conducted [17], an intrusion-detection model is presented using the combination of Square-Chi feature selection and multi-class SVM. Many it uses only one algorithm to classify network traffic as normal and abnormal. Due to the large volume of data, this classifier model does not succeed in achieving a high attack detection rate and reducing the false alarm rate. However, in this solution, by reducing the dimensions of the data, they could achieve an optimal set of features without losing information, and then using the multi-class modeling method. They identified and classified different network attacks.
In research [18], researchers have presented an intrusion-detection system based on the integration of cluster centers and closest neighbors, and based on this, they have proposed a feature display method called CANN. In this method, two distances are measured and added. The first distance is calculated based on the distance between each data sample and its cluster center. The second distance is calculated based on the distance between each data sample and its nearest neighbor. In the following, the sum of these distances is used to display each data sample in the KNN algorithm. The results of the implementation and evaluation of this method on the KDD99Cup data set show that this method is more accurate than the simple KNN method.
In the research [19], a new technique is introduced to improve the intrusion-detection process by managing the complexities of big data related to different forms of heterogeneous security data. To achieve the former objective, SVM
Ensemble is integrated with a CGO
Algorithm
The proposed method improves the intrusion classification accuracy and also identifies nine different types of attacks in the UNSW-NB15 data set. The efficiency of It is evaluated using statistical analysis and various performance measures such as precision, recall, F1 score, precision, ROC curve and confusion matrix by comparing it with unusual base models. The proposed method achieves 96.29% accuracy compared to chip-SVM 89.12%, and 6.47% improvement in it is mentioned in terms of accuracy compared to chip-SVM. The higher classification accuracy shows that it shows fewer false positives when handling security events in big data platforms.