3.1. Data mining technology
Generally, the main contents of data mining operations include: preprocessing the starting data, mining the prepared data, and displaying or storing mining knowledge. The specific steps of the data mining operation are as follows:
(1) Data processing: The original data set may include noisy and incomplete or different data. Therefore, we need to repair the noisy data in the original data set and add or delete incomplete data.
(2) Data aggregation: Data aggregation with various formats and characteristics is stored in a file in a database or data warehouse so that users can easily retrieve the data they need. Generally, data aggregation can effectively improve the efficiency of data sharing.
(3) Select data: Use statistical learning methods to extract and analyze data related to this operation from the data warehouse, which can not only increase the data usage rate, but also help data mining after the fact.
(4) Modifying the data: Because the metrics of each part of the data set are different, it is necessary to standardize the data to obtain a data format suitable for the data mining process.
(5) Data mining: Data mining is the process of continuously optimizing the problem to the target. It must meet the needs of users, which means it must dig out content that is useful or interesting to users.
(6) Model evaluation: starting from the problem target solution, use the corresponding evaluation indicators to evaluate and ensure the accuracy of the process.
(7) Representation of knowledge: For users who understand data-driven knowledge, data visualization technology can be used to represent the acquired knowledge, or the acquired new knowledge can be stored in the public resource pool of other applications used by the program.
The main process of data mining is shown in Fig. 1:
In the process of intrusion testing, data mining technology is used to mine valuable information and knowledge hidden in a large amount of test security data, and make the acquired professional knowledge into understandable rules or patterns, and then use the rules or patterns that have been created. Mode to effectively detect any activity criteria suspected of attacking network data.
(1) Association rule This rule is a technique for searching related data items across multiple data sets. Linking rules are usually divided into two steps to find the relationship between data objects. The first is to search for all common items, that is, items that meet the minimum applicable standards. Usually, the task of the second step is more important than the first step. When processing multiple data sets, we must use a large number of calculation sources. The "IF DONE" model based on uncontrolled learning can be used to express association rules. At the same time, Apriori's FP growth algorithm and association rule technology are widely used in recommendation and input analysis systems. If you only use traditional technical link rules to determine whether a large part of the network data contains suspicious activities or behaviors, you will get an error message while looking for things of equal importance and wasting a lot of time and resources. Before being used to detect network attacks, it must effectively solve the problem of the lack of analytical capabilities of existing association rules.
(2) Sequence mode this mode is the same as the link rule used to find links between data. However, the focus is different. The link rule focuses on determining the relationship between different data objects in the data set, and the sorting mode focuses on finding the relationship and time relationship between each record in the data set.
Given a data set S = [x, x, ... x,] with n data, the FCM algorithm uses the minimum objective function to divide the n data into c (2 ≤ c < n). The objective function of the algorithm is:
$$\begin{array}{c}minJ(U,V)=\sum _{i=1}^{c}\sum _{j=1}^{n}{u}_{ij}^{m}{\parallel {\text{x}}_{j}-{\text{v}}_{i}\parallel }^{2}\\ \text{ }\text{s}\text{.}\text{t}\text{.}\text{ }\sum _{i=1}^{c}{u}_{ij}=\text{1,1}\le j\le n\\ \sum _{j=1}^{n}{u}_{ij}>\text{0,1}\le i\le c\\ {u}_{ij}>\text{0,1}\le i\le c,1\le j\le n\end{array}$$
1
The FCM algorithm finds the most central cluster by subtracting the objective function J. Therefore, the Lagrangian objective function is constructed using Lagrangian motion, which has more constraints and target motion, and its function is expressed as:
$$F=\sum _{i=1}^{c}\sum _{j=1}^{n}{u}_{ij}^{m}{\parallel {\text{x}}_{j}-{\text{v}}_{i}\parallel }^{2}+\alpha (\sum _{i=1}^{c}{u}_{ij}-1)$$
2
Among them, the Lagrangian multiplier is represented by α. Now, calculate the partial derivative about the membership degree uij and the cluster center vi according to the formula:
$$\frac{\partial F}{\partial {u}_{ij}}=m{\left({u}_{ij}\right)}^{m-1}{\parallel {\text{x}}_{j}-{\text{v}}_{i}\parallel }^{2}+\alpha$$
3
$$\frac{\partial F}{\partial {\text{v}}_{i}}=-2\sum _{j=1}^{n}{u}_{ij}^{m}({\text{x}}_{j}-{\text{v}}_{i})$$
4
Separate order\(\frac{\partial F}{\partial {u}_{ij}}\)=0 and \(\frac{\partial F}{\partial {\text{v}}_{i}}\)=0 And according to\(\sum _{i=1}^{c}{u}_{ij}=1\) The calculation formulas of membership degree uij and cluster center vi, are obtained, such as formulas (5) and (6):
$${u}_{ij}=\frac{1}{\sum _{k=1}^{c}{\left(\frac{\parallel {\text{x}}_{j}-{\text{v}}_{i}\parallel }{\parallel {\text{x}}_{j}-{\text{v}}_{k}\parallel }\right)}^{\frac{2}{m-1}}}$$
5
$${v}_{i}=\frac{\sum _{j=1}^{n}{u}_{ij}^{m}{\text{x}}_{j}}{\sum _{j=1}^{n}{u}_{ij}^{m}}$$
6
(1) Euclidean distance
Euclidean distance is also called L2 standard. This is the most used distance calculation method in the cluster analysis process. The Euclidean distance formula is given in (7):
$$d({\text{x}}_{i},{\text{x}}_{j})={\parallel {\text{x}}_{i}-{\text{x}}_{j}\parallel }_{2}=\sqrt{\sum _{k=1}^{q}{({x}_{ik}-{x}_{jk})}^{2}}$$
7
(2) Manhattan distance
The Manhattan distance is also known as the L1 standard. It takes the absolute amount of variance between the features of each data and compiles it. The Manhattan distance formula is given in (8):
$$d({\text{x}}_{i},{\text{x}}_{j})=\sum _{k=1}^{q}|{x}_{ik}-{x}_{jk}|$$
8
(3) Chebyshev distance
The Chebyshev distance is also called the L ∞ metric, which represents the maximum absolute difference between each data shape. The Chebyshev distance formula is given in (9):
$$d({\text{x}}_{i},{\text{x}}_{j})=\underset{k}{max}\left(\right|{x}_{ik}-{x}_{jk}\left|\right)$$
9
(4) Minkoffs distance
The scope of Minkoffs seems short and agile, but it has nothing to do with data distribution and has certain limitations. The Minkoffs distance formula is given in (10):
$$d({\text{x}}_{i},{\text{x}}_{j})={\left(\sum _{k=1}^{q}{|{x}_{ik}-{x}_{jk}|}^{r}\right)}^{\frac{1}{r}}$$
10
The calculation process of the ratio obtained from the information is mainly as follows: the data set S has P types, l ≤ p ≤ P), | s | represents his capacity, and | yp | represents the amount of data in category y. Suppose Ak (1 ≤ k ≤ q) has L different values. According to the value of part A, the data set is the subset L [S1, S2,..., SL], |. SL | (1 < 1 ≤ L) is the number of S1. The data set of category y p belonging to the subset S1 is Slp•|Slp|, which represents the amount of data in Slp. The empirical calculation method of data set S is given in formula (11):
$$H\left(S\right)=-\sum _{p=1}^{P}\frac{\left|{y}_{p}\right|}{\left|S\right|}{\text{l}\text{o}\text{g}}_{2}\frac{\left|{y}_{p}\right|}{\left|S\right|}$$
11
The formula (12) gives how to calculate the empirical conditional entropy of the feature AK in the data set S:
$$H(S\mid {A}_{k})=\sum _{i=1}^{L}\frac{\left|{\text{S}}_{i}\right|}{\left|S\right|}H\left({S}_{l}\right)=-\sum _{i=1}^{L}\frac{\left|{S}_{i}\right|}{\left|S\right|}\sum _{p=1}^{P}\frac{\left|{S}_{ip}\right|}{\left|{S}_{l}\right|}{\text{l}\text{o}\text{g}}_{2}\frac{\left|{S}_{lp}\right|}{\left|{S}_{l}\right|}$$
12
The formula (13) gives the algorithm of the information gain of the characteristic AK:
$$\text{g}\text{a}\text{i}\text{n}(S,{A}_{k})=H\left(S\right)-H(S\mid {A}_{k})$$
13
Formula (14) gives the algorithm of information entropy of feature AK:
$$H\left({A}_{k}\right)=-\sum _{l=1}^{L}\frac{\left|{S}_{l}\right|}{\left|S\right|}{\text{l}\text{o}\text{g}}_{2}\frac{\left|{S}_{l}\right|}{\left|S\right|}$$
14
The formula (15) gives the algorithm of the information gain ratio of the characteristic AK:
$${\text{ gain }}_{-}{\text{ratio }}_{k}=\frac{\text{gain}(S,{A}_{k})}{H\left({A}_{k}\right)}$$
15
In the Euclidean distance calculation formula, the new distance calculation formula can be obtained by obtaining the information obtained from the shape ratio (such as weight):
$$d({x}_{i},{x}_{j})=\sqrt{\sum _{k=1}^{q}{\text{g}\text{a}\text{i}\text{n}}_{-}{\text{r}\text{a}\text{t}\text{i}\text{o}}_{k}\times {({x}_{ik}-{x}_{jk})}^{2}}$$
16
As a result of in-depth research on the existing FCM algorithm, it is found that if the members of each cluster are updated as they change, the difference in importance between functions will not be considered. Therefore, the FCM algorithm now uses Euclidean weighted distance to calculate the minimum value of the objective function J:
$$\begin{array}{c}minJ(U,V)=\sum _{i=1}^{c}\sum _{j=1}^{n}{u}_{ij}^{m}d({\text{x}}_{i},{\text{x}}_{j})\\ \text{ }\text{s}\text{.}\text{t}\text{.}\text{ }\sum _{i=1}^{c}{u}_{ij}=\text{1,1}\le j\le n\\ \sum _{j=1}^{n}{u}_{ij}>\text{0,1}\le i\le c\\ {u}_{ij}>\text{0,1}\le i\le c,1\le j\le n\end{array}$$
17
Using the Lagrangian calculation method, a new membership degree iju and cluster center calculation formula Vi can be obtained, as shown in formulas (18) and (19):
$${u}_{ij}=\frac{1}{\sum _{k=1}^{c}{\left(\frac{d({\text{x}}_{j},{\text{v}}_{i})}{d({\text{x}}_{j},{\text{v}}_{k})}\right)}^{\frac{1}{m-1}}}$$
18
$${v}_{i}=\frac{\sum _{j=1}^{n}{u}_{ij}^{m}{\text{x}}_{j}}{\sum _{j=1}^{n}{u}_{ij}^{m}}$$
19
Existing FCM algorithms must use the cluster center when updating the membership level of each data, but they have a certain degree of randomness when selecting the initial cluster center. A set of more effective default cluster centers can improve computing efficiency and reduce running time, thereby saving multiple computing resources. Therefore, this article uses the density method to select a set of good initial cluster centers. First, as shown in formulas (20) and (21), calculate the average distance and data density ix of all data:
$$md=\frac{2}{n(n-1)}\sum _{i=1}^{n}\sum _{j=i+1}^{n}d({\text{x}}_{i},{\text{x}}_{j})$$
20
$${\rho }_{i}=\sum _{j=1}^{n}\text{e}\text{x}\text{p}(-{\left(\frac{d({\text{x}}_{i},{\text{x}}_{j})}{md}\right)}^{2})$$
21
Among them, n ≥ i > 1, n ≥ j > 2. Since the function d(•) is symmetric, the value of j must start from i + 1, which can save calculation time and reduce the load on the computer.
3.2. The simulation analysis
Normalization requires that a certain environmental quality be measured according to a certain ratio in order to be used as a measure of a certain interval. Through normalization, all eigenvalues can be uniformly scaled to pure values without measurement in the interval [0,1] or [-1,1]. The normalization process increases the speed of model generation, reduces the time required for calculation, and allows outliers to improve accuracy, so as to achieve the best clustering effect. If the characteristic value to be processed is a, the calculation formula of this method is shown in (22):
$$a=\frac{a-{a}_{min}}{{a}_{max}-{a}_{min}}$$
22
The calculation formula of this method is shown in (23):
$${\text{x}}_{i}^{{\prime }}=\frac{{\text{x}}_{i}-\mu }{\sigma }$$
23
$${\text{ Entropy }}_{i}=-\sum _{j=1}^{P}\frac{{m}_{ij}}{{m}_{i}}{\text{l}\text{o}\text{g}}_{2}\frac{{m}_{ij}}{{m}_{i}}$$
24
The data of the initial clustering centers randomly selected by the existing FCM algorithm are shown in Table 1.
Table 1
Initial cluster centers randomly selected
Init_center 1 | (4.8, 2.4, 4.6, 1.8) |
Init_center 2 | (7.8, 3.9, 6.3, 2.1) |
Init_center 3 | (5.2, 3.6, 1.5, 0.3) |
The clustering result obtained by the conventional FCM algorithm after dividing the set iris data according to the randomly selected initial clustering center. See Table 2 for details.
Table 2
The clustering results of the traditional FCM algorithm on the iris data set
Class | Iris Setosa | Iris Vd sicolour | Iris Viiginica |
1 | 60 | 0 | 0 |
2 | 0 | 37 | 1 |
3 | 0 | 23 | 59 |
Comparing the clusters of these three partitions, we can see that the partition of Setosa iris is very good, and its entropy value is 0.0, so the clustering effect of this cluster is the largest. The entropy values of the other two clusters are 0.60 and 0.78, respectively, indicating that these two clusters are doped with different independent data, and the cluster effect has not reached the ideal state. Figure 2 shows three components selected from the clustering results of the existing FCM algorithm for 3D visualization: the length of the sepal, the length and width of the petal.
To start, use the proportions obtained from the information to calculate the importance of each component.
Table 3
The importance of different characteristics
Feature name | sepal leiigtli | sepal widtli | petal leiigtli | petal widtli |
Importance | 0.09630315 | 0.0258686 | 0.41132545 | 0.4665028 |
From the data in Table 3, it can be concluded that the two most important features are petal width and length, while the importance of sepal length and width is as low as 0.0965 and 0.0256, respectively. This is evidence showing the difference in shape. As a result, during the clustering process, the inequality between the data uses the Euclidean weighted distance as a reference.
Table 4
Use the density method to select the initial cluster centers
init_center1 | (6.1, 2.8, 4.2, 1.2) |
init_center2 | (7.2, 2.8, 6.2, 1.9) |
init_center3 | (5.3, 3.1, 4.6, 1.4) |
Table 4 represents the default cluster centers selected using the density method.
Table 5
The clustering results of the improved FCM algorithm on the iris data set
Class Clustei* | his Setosa | his Versicoloiu' | Iris Viiginica |
1 | 60 | 0 | 0 |
2 | 0 | 58 | 7 |
3 | 0 | 2 | 53 |
From the clustering results in Table 5, it can be concluded that the improved FCM algorithm can not only clearly classify the iris plants of the iris Setosa category, but also improve the classification of iris and iris. In order to better evaluate the effect of clustering, the entropy value of each cluster is calculated as 0.0, 0.145 and 0.583 respectively. The entropy value of cluster 1 is the same as the entropy value obtained by the existing FCM algorithm, and the other two values are compared with the existing FCM algorithm. The entropy of each cluster is reduced to some extent, which indicates that most of the data in the same category are divided into the same cluster. In order to confirm that the improved FCM algorithm is useful, according to the data in Table 2 and Table 5, the total accuracy is calculated to be 89.23% and 94.5%, respectively. It can be seen that the improved FCM has a certain degree of increase in accuracy.
Figure 3 shows an improved FCM algorithm for 3D visualization of the clustering results of the iris dataset. Comparing Fig. 2, it can also be clearly seen that the number of Iris plants of the iris type has been correctly divided and the number has increased.