The FP-Growth algorithm constructs the data set as FP-tree and stores it in memory, then mines frequent item sets by recursively calls condition of FP-tree [18]. However, when mining massive data with low support degree, the FP-tree generated by FP-growth will occupy a large amount of memory, causing problems such as memory overflow and long operation time [19–20]. This research tries to solve the above problems through the optimization design of the parallel FP-growth algorithm, proposes a load balancing optimization scheme based on the parallel FP-growth algorithm, and builds an algorithm experimental environment under condition of the Hadoop framework. Through experiments, the efficiency of traditional FP-growth algorithm, parallel FP-growth algorithm and parallel FP-growth algorithm considering load balancing with different data volume and different minimum support threshold are compared and analyzed respectively.
3.1 Principle of parallel FP-Growth algorithm
The performance bottleneck of FP-growth algorithm when processing massive data is solved using horizontal partitioning [19]. The horizontal partitioning method is to divide the database into various sub-database nodes, make frequency statistics for each sub-node, build local FP-tree, obtain the conditional pattern base of each data frequent item set, gather various conditional pattern bases to the corresponding nodes, and then build conditional FP-tree for each node, operate the local FP-growth mining. The local FP-tree diagram considering the influencing factors of traffic crashes is shown in Fig. 1.
3.2 Process design of parallel FP-growth algorithm
The operation process of the parallel FP-growth algorithm can be summarized into the following five steps, and the algorithm process is introduced in combination with this research case. The specific process design is shown in Fig. 2:
Step 1, Shard: The freeway historical crash database of Washington state is horizontally divided into six consecutive sub-databases from the dimensions of driver, vehicle, road, environment, time and crash, which are stored on several different computers respectively. Each sub-database obtained after segmentation is defined as slice.
Step 2, Parallel counting: Read each slice obtained in Step 1 and use "Mapreduce. job" to count frequent "1- itemset FList". Where the "Mapper" function starts to input with the key pairs of < The key, value = Ti>, after processing, transaction Ti is output as key pairs < Key = aj, value = 1>, where AJ is the Jth data item of Ti; All slices are processed by Mapper function, and the key pairs with the same key value are summed by Reduce function. The key values output by the Reduce function are sorted in order of frequency from high to low. Finally, FList is obtained according to the minimum support filtering.
Step 3, Grouping for Frequent 1- item sets: All frequent items in FList are evenly divided into GROUP G, in which each group has a unique group ID- Gid. Frequent items and their corresponding group numbers constitute frequent item table group GList and are assigned into all slices.
Step 4, design parallel FP-growth: This step is the core step of the parallel algorithm. Partial FP-growth mining is operated through a MapReduce. Job, where the Mapper function is defined by key pairs of < The key, value = Ti> to read each slice. Scan all frequent items Ai in transaction Ti, find the corresponding group ID identifier Gid in the corresponding frequent item table group GList, and output it with the key value. The output of Mapper function is read through the Reduce function corresponding to different Key values, local FP-tree is built, and local FP-growth mining is carried out for frequent items corresponding to each Gid in a recursive way. The frequent item sets obtained from mining are saved in a heap in order from large to small, and the top K frequent item sets with the highest support are finally returned.
Step 5. Aggregation: Read the output of Step 4, combine the same key-value pairs via the Mapper function, and then use the Reduce function to calculate and output the frequent item sets of the top K items with the highest global support and confidence.
3.3 Implementation of optimized parallel FP-growth algorithm with load balancing constraints
3.3.1 The load balancing definition and grouping strategy
Load balancing refers to the balancing of loads (work tasks) and spreading them across multiple units of operation so that work tasks can be completed cooperatively. The load balancing defined in this research refers to the balanced grouping of frequent 1-itemsets to achieve the relative balance of the running load for each slice, so as to improve the efficiency of data mining. The core strategy of load balancing mainly includes two steps: The first step is to build a load model and recursively mine the total workload of FP-tree on the projection database for each frequent data item. The second step is to divide all frequent items into G groups by balanced grouping strategy and divide transaction T into G groups to ensure the load balancing of each node. Tasks are completed by corresponding nodes of various groups, and FP-tree is recursively mined for frequent items, so as to improve the processing capability of the system.
(1) Establishment of load model
For the FP-growth algorithm, it is not realistic to accurately calculate the workload of each frequent item, we can only reasonably estimate the workload. The workload of recursively mining the condition FP-tree of each frequent item is equal to the times that the condition FP-tree of the frequent item recursively calls FP-growth. The length of the longest frequent path of each frequent item in the conditional FP-tree is equivalent to the location of the frequent item in the FList.
In the conditional FP-tree, the number of times that the same frequent item recurrently calls FP-growth has an exponential relationship with the length of its longest frequent path [18–20], and the load model formula (1) can be obtained, where Fi is the load estimated value of each node, and P (I, FList) represents the position of the frequent item in FList of frequent 1-itemset:
(2) Balanced grouping
According to formula (1), in frequent 1-itemset FList, the higher the support count of frequent items is, the higher its position is, and the larger its load value is. Therefore, the load value of frequent items in FList forms an increasing trend. Since it is difficult to achieve a globally optimal grouping, this research uses a greedy strategy to group frequent items. Assuming that the quantity of item groups is G, FList is grouped in order from the back to the front. First, the first G frequent items are put into G groups, and then the frequent items to be grouped are put into the group with the smallest total load in sequence. At the same time, the load value of the frequent item is accumulated to the total load of the group. Repeat the steps until all the frequent items in FList are allocated.
3.3.2 Implementation of load balancing algorithm
The process of the improved load-balancing parallel FP-growth algorithm is the same as that of the above parallel FP-growth algorithm. Only the frequent 1-itemset grouping in the third step is improved. The pseudo-code and flow chart of the algorithm are shown in Table 2 and Fig. 3 respectively.
Table 2
Code implementation of load balancing algorithm
Load balancing algorithm pseudocode:
|
generateGList Function:
1. void generateGList(List FList, int G){
2. HashTable GList;
3. Heap minHP;
4. groupsNum ← G;
5. Gid ← 0;
6. if FList.length() < groupsNum then
7. foreach item aj in FList do
8. GList.put(aj, Gid);
9. Gid ← Gid + 1;
10. end
11. else
12. for j = 0 to groupsNum do
13. F(item[j]); // Load estimation
14. minHP.add(item[j]);
15. minHP.adjust();
16. end
17. for j = groupsNum to FList.length() do
|
18. F(item[j]); // Load estimation
19. minHP[0].weight ← minHP[0].weight + item[j].weight;
20. GList.put(item[j], minHP[0].group);
21. minHP.adjust();
22. end
23. end
24. }
|
3.4 Construction of experimental environment
In this research, there is a large amount of historical crash data on freeways, so the complete distribution mode is selected in the experiment to build a Hadoop distributed processing system. A cluster is built with three computers installed with Linux system. The cluster has three nodes in total, including one master node and two slave nodes. The hardware and software configuration of each computer is shown in Table 3.
Table 3
Virtual machine configuration
The hardware configuration
|
The software configuration
|
CPU: Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.2GHz
|
Linux: Ubantu 18.04.5
|
Memory: 32GB
|
Java version: JDK-8U161-linux-x64
|
Hard Disk: 360GB
|
Hadoop version: Hadoop-2.7.4
|
GPU: AMD_Radeon R7 350
|
Integrated development environment: Eclipse
|
The process of building a Hadoop framework cluster is as follows:
Step 1: Install Linux Ubantu 18.04.5 on the three computers respectively, change the host name and network configuration to "master", "slave1", and "slave2".
Step 2: Modify the hosts file on the computer, set the public and private keys to implement SSH login without password between computers.
Step 3: Upload and install JDK and configure environment variables for these computers.
Step 4: Upload and install Hadoop, perform operations on environment variables and configuration files in the conf file of the computer, and format HDFS.
Step 5: Start the Hadoop cluster and run the JPS command to view all processes. The Hadoop cluster process starts as shown in Fig. 4.
3.5 Comparison of improved algorithm efficiency
In order to verify the rationality of the algorithm designed in this research, the serial FP-growth algorithm, parallel FP-growth algorithm, the parallel FP-growth algorithm with load balancing constraints are respectively run under the Hadoop cluster. In this experiment, the computational efficiency of the algorithm is compared and analyzed from the perspectives of data volume and minimum support.
(1) Sensitivity analysis for data volume on algorithm efficiency
Various records are randomly selected from the data set to constitute the data set, and the four virtual processed data sets used in the experiment include 100,000, 500,000, 1,000,000 and 1,500,000 records respectively. The results of running the three algorithms with a minimum support of 0.5 are shown in Fig. 5.
Figure 5 indicates that, when the experimental data set is small, the running time difference of the three algorithms is very small. The possible explanation is that the two parallel FP-growth algorithms need extra time to start the cluster, which fails to take advantage of the parallel algorithm. With the increase of the data set, the operation time of the two parallel FP-growth algorithms is greatly reduced compared with the serial algorithm, and the performance advantage of the algorithms is more obvious with a larger number of records. Compared with the parallel FP-growth algorithm without load balancing, the operation time of the parallel FP-growth algorithm with load balancing based on balanced grouping method can be further reduced, which is more suitable for mining large data sets. The experimental results in Fig. 5 can prove that the optimized load-balancing parallel FP-growth algorithm can not only solve the problems such as large memory consumption and long running time when the traditional serial algorithm performs big data mining, but also further shorten the data mining time and improve the efficiency of the algorithm.
(2) Sensitivity analysis for the minimum on algorithm efficiency
According to the previous analysis of association rules and algorithms in this research, the setting of minimum support degree has a great impact on the operation efficiency of the algorithm. The smaller the minimum support, the greater the number of frequent items, and the longer the operation time of the algorithm. In this experiment, the data set with 500,000 records in experiment (1) is selected, and three algorithms are adopted for data mining under the condition that the minimum support is set at 0.1–0.7. The calculation time comparison is shown in Fig. 6.
Figure 6 indicates that, when the minimum support is high, such as 0.6 and 0.7, the number of frequent items in the data set is small, and the operation time of the three algorithms is short. With the decrease of support, the number of frequent items in the data set increases, and the operation time of the algorithms begins to rise. When the minimum support is lower than a certain degree, the serial FP-growth algorithm has the problem of memory heap overflow. At this time, the algorithm will automatically report an error and stop running. However, the parallel FP-growth algorithm overcomes this shortcoming, and can still run under a lower support threshold, its running time is less than the serial algorithm. According to the overall running time of the algorithm, the running efficiency of the optimized parallel FP-growth algorithm is higher than that of the original one, its advantages become more prominent as the support threshold decreases.
The above experimental results show that the optimized parallel FP-growth algorithm based on Hadoop proposed in this research can perform efficient data mining in the case of low support. The operating efficiency of the improved algorithm is obviously better than that of the serial algorithm and the original parallel FP-growth algorithm, and it can be well applied to the subsequent data mining for freeway traffic crashes in this research.