An Algorithm to Create Sorted FP-Growth Tree for Extracting Association Rules

doi:10.21203/rs.3.rs-2076910/v1

Download PDF

Research Article

An Algorithm to Create Sorted FP-Growth Tree for Extracting Association Rules

https://doi.org/10.21203/rs.3.rs-2076910/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Mining data in massive databases to find useful data, frequent patterns (FP) and knowledge discovery, has been studied popularly in data mining research fields minding governments, industries and sales companies. In data mining, frequent patterns as growth tree are an efficient method for discovering knowledge and compressing information in a tree structure. Previous studies presented various methods to achieve frequent patterns which even require complex process and costs, particularly if the patterns overnumbered. In this paper we provide a divide and conquer algorithm based on FP-Growth Tree to create an initially sorted tree structure of the nodes that the most frequent patterns would be available at every moment of the tree construction procedure. In addition, we can consider a parameter to avoid inserting less frequent branches, thus, the tree would be sorted from the beginning of its formation.

data mining

big data

FP-Growth Tree

association rules

Association rules mining (ARM) has been noticed in the data mining research field significantly. ARM necessarily would be functional in various applications, such as cross-marketing, market basket analysis, fraud detection, filtering and etc. FP-Tree is an efficient and scalable algorithm to calculate frequent patterns that involved by ARM. This method concludes a prefix structure to compress and present FPs. Han et al. proved [1] that FP-Growth Tree method performs better than previous ones such as Apriori [2] or Treeprojection [3]. In other researches [4][5][6] FP-Growth was proved that even runs more efficient than Eclat [7] and Relim [8]. The FP-Growth performance reckoned to be more popular and many investigated to be optimized. FP-Growth algorithm is a suitable mining FPs solution without candidate generation, that utilize divide-and-conquer to gain association rules. Generally, this algorithm operates as the database would be compressed to present FPs as FP-Tree structure.

Zaki [9] has developed the equivalence class transformation (ECLAT) algorithm for frequent itemset mining. It relies on the intersection property and can perform both sequential and parallel computing. Unlike Apriori and FP-growth, ECLAT works vertically and creates a TID list for each item.

Hyper-structure mining of frequent patterns (H-Mine) algorithm [10] proposed by Pei J et al. to overcome the FP-growth performance bottlenecks by using queues rather than a tree data structure. H-Mine organizes transaction items into distinct queues and uses hyperlinks to connect transactions having the same first item name. It is considered good for sparse datasets and is efficient in terms of memory and runtime consumption than Apriori and FP-growth.

Li H et al. [11] introduced a parallel-processing-based algorithm which utilized three Map-Reduce phase to extract tasks of the FP-Growth and incorporate intermediate data.

Grahne G. and Zhu J [12] have introduced a novel array-based technique that allows using FP-trees more efficiently when mining frequent itemsets. That technique greatly reduces the time spent traversing FP-trees, and works especially well for sparse datasets. Furthermore, the algorithms for mining maximal and closed frequent itemsets. The FP-growth* algorithm, which extends original FP-growth method, also uses the novel array technique to mine all frequent itemsets. They also proposed another FP-array approach [13] to minimize tree traversing time, but the recurring construction of conditional FP-trees still exists.

Shawkat et al. [14] presented a new scheme to discover associations from a wide range of relations across the dataset. Trough combining the FP-tree* mining method and header table of FP-growth they developed an algorithm as MFP-growth.

An Algorithm proposed which is called Sorted Frequent Pattern Growth Tree (SFP-Growth Tree) based on the basic FP-Tree theory designed to challenge a couple of issues. Firstly, in the basic FP-Tree, after the completion of operation, the resulted tree would have not been sorted and finding most frequent patterns requires the tree being sorted or spend the cost of search every time which would not be advantageous. To solve this challenge proposed algorithm creates a tree sorted from the initial levels to the end nodes. Secondly, users may do not want less-frequented patterns while all the transactions have to be inserted in the basic FP-Growth Tree. Due to being initially sorted tree by the proposed algorithm, the undesirable less-frequented patterns could be omitted and the result contain a pruned tree with the most frequented patterns according to the user’s setting. In this method, a node with its metadata is inserted in the tree once and it does not need to be updated until the end of operation. Achieving this aim, reading transactions must be reformed so that the tree grows level by level horizontally. Thus, instead of traversing the tree unregularly inserting nodes, level a must be filled then level a+1 be processed. The figure 1 illustrates nodes inserting sequence direction.

In Table.1, a transaction database is considered as an example that the data would be as the Frequent Items after the first scan[1] (preparation stage, sorting and applying min_support).

Table 1

a transaction database as running example

TID	Items in Transaction	(ordered) Frequent Items
100	f, a, c, d, g, i, m, p	f, c, a, m, p
200	a, b, c, f, l, m, o,	f, c, a, b, m
300	b, f, h, j, o	f, b
400	b, c, k, s, p	c, b, p
500	a, f, c, e, l, p, m, n	f, c, a, m, p

Now, by the proposed algorithm the data should be read according to frequency. As the first entity of transactions are the most frequented item in that transaction construct level 1 of the tree. Then n^th entity of each transaction constructs level n. In the example, the data f, c with the frequency of 4, 1 would be positioned at level 1.

It is proposed that in each step of inserting nodes, insert them sorted. Next step, the transaction with f prefix would be read, the frequency of each node calculated and finally all the children created below the parent node at once sorted.

The previous operation repeats for every node on the same level.

This process repeats for every prefix of the tree until whole tree will be constructed sorted. The result SFP-Growth tree for the example would be as figure 4.

Comparing the resulted tree to the tree of basic algorithm, the branches, paths and patterns are similar, however, the SFP-Growth Tree is sorted.

In very large databases, aborting the process anytime, the top-most sorted nodes have been inserted in the tree to the aborting moment.
During the tree construction process, each node is inserted once and no need to be updated. The node just will be read to find its children.
Considering the theory of SFP-Growth Tree construction stages, we can apply a parameter which defines the importance of frequent patterns, such that at most how many children for each node are accepted that introduce the n most frequent paths would be saved. Moreover, the algorithm avoids to calculate the paths over the defined value, so, the resource consuming of operating will have been scrounged.
Due to inserting the nodes into the tree sorted, there will no need to rebuild or sort the resulted tree.

In previous example, if we consider the most-frequent parameter as 2 the resulted tree would be:

Threshold of min_support is a suitable criterion to mine frequent patterns but not enough. Because it only considers a single item’s frequency. In Addition, utilizing other criterions such as min_conf would be complex and costly. In large databases, on the other hand, after mining patterns we find more of them useless. The proposed method can be useful to disregard those useless patterns. Moreover, a sorted large data in a compressed structure, can facilitate the next processes.

Ethical Approval

We further confirm that any aspect of the work covered in this manuscript that has involved human patients has been conducted with the ethical approval of all relevant bodies and that such approvals are acknowledged within the manuscript.

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. We wish to draw the attention of the Editor to the following facts which may be considered as potential conflicts of interest and to significant financial contributions to this work.

Authors' contributions

We confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship but are not listed. We further confirm that the order of authors listed in the manuscript has been approved by all of us.

Funding

No funding was received for this work.

Availability of data and materials

We confirm that no special dataset was used in this paper.

Intellectual Property

We confirm that we have given due consideration to the protection of intellectual property associated with this work and that there are no impediments to publication, including the timing of publication, with respect to intellectual property. In so doing we confirm that we have followed the regulations of our institutions concerning intellectual property.

We understand that the Corresponding Author is the sole contact for the Editorial process (including Editorial Manager and direct communications with the office). He is responsible for communicating with the other authors about progress, submissions of revisions and final approval of proofs. We confirm that we have provided a current, correct email address which is accessible by the Corresponding Author.

J. Han, H. Pei, and Y. Yin. Mining Frequent Patterns without Candidate Generation. In: Proc. Conf. on the Management of Data (SIGMOD’00, Dallas, TX). ACM Press, New York, NY, USA 2000
 Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proc. 1994 Int. Conf. Very Large Data Bases (VLDB’94), Santiago, Chile, pp. 487–499.
Agarwal, R., Aggarwal, C., and Prasad, V.V.V. 2001. A tree projection algorithm for generation of frequent itemsets. Journal of Parallel and Distributed Computing, 61:350–371.
B.Santhosh Kumar and K.V.Rukmani. Implementation of Web Usage Mining Using APRIORI and FP Growth Algorithms. Int. J. of Advanced Networking and Applications, Volume: 01, Issue:06, Pages: 400-404 (2010).
Cornelia Gyorödi and Robert Gyorödi. A Comparative Study of Association Rules Mining Algorithms. SACI 2004, 1st Romanian-Hungarian Joint Symposium on Applied Computational Intelligence.
F. Bonchi and B. Goethals. FP-Bonsai: The Art of Growing and Pruning Small FP-trees. Proc. 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’04, Sydney, Australia), 155–160. Springer-Verlag, Heidelberg, Germany 2004.
M. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New Algorithms for Fast Discovery of Association Rules. Proc. 3rd Int. Conf. on Knowledge Discovery and Data Mining (KDD’97), 283–296. AAAI Press, Menlo Park, CA, USA 1997.
 Christian Borgelt. Keeping Things Simple: Finding Frequent Item Sets by Recursive Elimination. Workshop Open Source Data Mining Software (OSDM'05, Chicago, IL), 66-70. ACM Press, New York, NY, USA 2005
Zaki MJ (1997) Fast mining of sequential patterns in very large databases. University of Rochester Computer Science Department, New York
Pei J, Han J, Lu H, Nishio S, Tang S, Yang D (2001) H-mine: hyper-structure mining of frequent patterns in large databases. In Data Mining. In: Proc.s IEEE Inter. Conf., IEEE, pp. 441–448
Li H, Wang Y, Zhang D, Zhang M, Chang EY (2009) PFP: parallel FP-growth for query recommendation. In: ACM Conference on Recommender Systems, pp. 107–114
Grahne G. and Zhu J. Efficiently Using Prefix-trees in Mining Frequent Itemsets, In Proc. of the IEEE ICDM Workshop on Frequent Itemset Mining, 2004
Grahne G, Zhu J (2005) Fast algorithms for frequent itemset mining using FP-trees. IEEE Trans Knowl Data Eng 17(10):1347–1362. https://doi.org/10.1109/TKDE.2005.166
Shawkat, M., Badawi, M., El-ghamrawy, S. et al. An optimized FP-growth algorithm for discovery of association rules. J Supercomput 78, 5479–5506 (2022). https://doi.org/10.1007/s11227-021-04066-y

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

An Algorithm to Create Sorted FP-Growth Tree for Extracting Association Rules

Status:

Version 1

Abstract

Figures

1. introduction

2. Related Works

3. Proposed Algorithm

4. The advantageous of Sorted FP-Growth Tree

5. Conclusion

Declarations

References

Additional Declarations

Status:

Version 1