Association Rule Mining Algorithms for Big Data using RDD-ECLAT Algorithms

: The revolution in technology for storing and processing big data leads to data intensive computing as a new paradigm. To find the valuable and precise big data knowledge, efficient and scalable data mining techniques are required. In data mining, different techniques are applied depending on the kind of knowledge to be mined. Association rules are generated from the frequent itemsets computed by frequent itemset mining (FIM) algorithms. The problem of designing scalable and efficient frequent itemset mining algorithms on the Spark RDD framework. The research done in this thesis aims to improve the performance (in terms of execution time) of the existing Spark-based frequent itemset mining algorithms and efficiently re-design other frequent itemset mining algorithms on Spark. The particular problem of interest is re-designing the Eclat algorithm in the distributed computing environment of the Spark. The paper proposes and implements a parallel Eclat algorithm using the Spark RDD architecture, dubbed RDD-Eclat. EclatV1 is the earliest version, followed by EclatV2, EclatV3, EclatV4, and EclatV5. Each version is the consequence of a different technique and heuristic being applied to the preceding variant. Following EclatV1, the filtered transaction technique is used, followed by heuristics for equivalence class partitioning in EclatV4 and EclatV5. EclatV2 and EclatV3 are slightly different algorithmically, as are EclatV4 and EclatV5. Experiments on synthetic and real-world datasets.


INTRODUCTION
It is either machine itself or human using machine; both are generated data continuously. Such a petabyte scale of data may be produced from business transactions, scientific simulation or experimentation, social networking sites, web contents, sensor devices etc. Data is the new oil used as the fuel for creating big business value [towardsdatascience.com]. But, without using certain appropriate algorithms, the big business values or insight cannot be harvested from the data The development in technologies for storing and processing large amounts of data has ushered in a new paradigm of data intensive computing. To extract meaningful and precise knowledge from large amounts of data, efficient and scalable data mining approaches are required. Data mining is the process of identifying and analysing hidden and fascinating patterns within a massive amount of data. Various techniques are used in data mining, depending on the type of knowledge to be mined. The data mining technique association rule mining (ARM) is used to uncover relevant relationships between database data objects. Association rules are derived from frequently occurring itemsets computed using frequent itemset mining (FIM) techniques. FP-Growth is one of the three fundamental algorithms for frequent mining itemsets. Numerous versions and expansions of these algorithms have been developed to facilitate the mining of frequently occurring itemsets. These are sequential algorithms that run on a single machine with limited processing and memory resources. Thus, these algorithms have been parallelized and disseminated to execute parallel and distributed computing systems. However, these parallel and distributed algorithms are incapable of efficiently handling and processing large amounts of data, as typical distributed systems focus on data exchange, which requires a high level of communication and network capacity. Additionally, it lacks a faulttolerant programming language and a high-level parallel programming language. Hadoop [http://hadoop.apache.org] is a distributed system that relocates computation to the location of the data rather than relocating the data. Hadoop is a fault-tolerant distributed batch processing system for massive amounts The following are the section's key contributions: 1. The parallel Eclat method is proposed as RDD-Eclat using the Spark RDD framework.
These variants are based on the different strategies applied in speculation that better performance will be achieved.
3. Transaction filtering is adopted on some algorithm variants to observe the effect on space and time complexity. The heuristics for partitioning equivalence classes are applied on some algorithms to achieve partitions with a balanced workload. 4. Extensive experiments are carried to compare the performance of all proposed algorithms with Spark-based Apriori. Additionally, the proposed methods are compared in terms of performance and scalability.

RELATED WORK
The formal name is KDD (Knowledge Discovery in Database) [Han et al. (2006)]. KDD is comprised of a set of three basic steps, respectively known as pre-processing, post-processing. The data mining step is the core of KDD that has several algorithms depending on the type of knowledge to be extracted. The pre-processing step consists of data cleaning, data integration, data selection, and data transformation. The post-processing step deals with patterns evaluation and knowledge representation [Han et al. (2006)], [Jen et al. (2016)]. In some literature, Data Mining and KDD are used interchangeably. However, data mining is the essential step that discovers or extracts hidden and interesting patterns from a huge volume of data [Han et al. (2006)]. The data mining methods are classified into two

Association Rule Generation
In order to generate association rules, computing and checking support and confidence of every potential rule is not an efficient approach. The confidence of a rule is defined in terms of support of the itemsets that form the rule [Jen et al. (, 2016)]. So, the association rule generation is divided into two steps. The first step produces frequent itemsets along with their support, and in the second step, strong rules are produced from these frequent itemsets [  Apriori, equivalence class and clique, and FP-tree. There is a long list of FIM algorithms, and most of them are either extension or improved version of these three algorithms. Apriori is the seminal algorithm for mining frequent itemsets efficiently, based on candidate generation using the Apriori property. A candidate or candidate itemset is a potentially frequent itemset not checked against minimum support. The algorithms Eclat and Clique exploit equivalence class, clique, and lattice properties to generate frequent itemsets. The FP-Growth algorithm mine frequent itemsets in a faster way without generating candidate itemsets. This thesis explores the first two frequent itemset mining algorithms, Apriori and Eclat, in Spark's distributed environment.  Another, one of the most excellent features of Spark is RDD Persistence i.e. persisting or caching the data in memory throughout the operations. It is key features for iterative algorithms where results of some computed data can be persisted as an RDD that can be reused again and again without re-computation that make latter operation much faster.
There are two methods to persist an RDD, persist() and cache(). These methods used on the action and after that result is kept in memory across the nodes of cluster. The persist() method takes a parameter as the storage level for persisting the RDD at that storage level. The different storage levels are memory only, disk only, memory and disk only. The cache() method cache the RDD at default storage level, memory only [RDD Programming Guide (2018)].

RDD-ECLAT ALGORITHMS
We implemented the Eclat algorithm in parallel using the Spark RDD framework and dubbed it RDD-Eclat. By employing increasingly varied techniques and heuristics, we propose five somewhat different variations of RDD-Eclat.
EclatV1 is the initial implementation, while its descendants EclatV2, EclatV3, EclatV4, and EclatV5 are the result of modifications to the preceding method. Each of the proposed algorithms is segmented into three to four phases. Phase does not refer to a MapReduce phase in this case, but to a logical step in the process.

EclatV1
The paired RDD is an RDD that contains the pairs of (key, value). The groupByKey() transformation concatenates all pairs that share a common key. The support count of an item is equal to the size of the tidset that contains it. The filter() transformation eliminates items with a support count less than min sup and creates a paired RDD, freqItemTids, that contains only frequent items and their associated tidset. The paired RDD freqItemCounts holds pairs of (item, count), where count is the item's support count. (itemTid. 1, itemTid. 2) is a pair of (key, value) values from a Tuple2 [RDD Programming Guide (2018)] type object, itemTid. Finally, the action collect() provides the complete contents of RDD, freqItemTids to the driver software, where it is sorted and placed in a list in ascending order of item support. The lineage graph for RDDs in Phase-1 of EclatV1 is depicted in Figure 3.

EclatV4 and EclatV5
Algorithms EclatV4 and EclatV5 split the equivalence classes into p partitions using EclatV3's heuristics, where p is a user-supplied value. Heuristics are employed to balance the partitions of equivalence classes. These two algorithms differ in phase four from EclatV3, while the first three phases are identical.

Equivalence Class Partitioners
In general, an efficient partitioning of RDDs reduces data shuffling across the network in the subsequent transformations.
This section describes the partitioning techniques used in the proposed algorithms to partition the RDDs of equivalence classes, ECs, for the parallel and independent computation of frequent itemsets. Partitioning is done on the basis of prefixes of equivalence classes. Spark facilitates to implement custom partitioner; our custom partitioners are based on the HashPartitioner, a default partitioner of Spark. A custom partitioner extends the Partitioner class of Spark, and implements its getPartition() method. The heuristic of partitioning is defined in this method. Algorithm 4.9 describes the pseudo codes of getPartition(v) method of three custom partitioners, where v is the unique value assigned to the 1-length prefix of equivalence classes. A hash map is used to map each frequent item to a unique integer between 0 and n-1, where n is number of frequent items. EclatV1, EclatV2, and EclatV3 create partitions individually for each equivalence class, i.e. (n-1) partitions. It is termed as default partitioning. EclatV4 and EclatV5 impose a restriction on the number of partitions and construct p partitions, with the user specifying the value of p.The numbers of partitions determine the number of parallel tasks.

4.RESULT AND DISCUSSION
The experimental environment, datasets used, and method performance in terms of execution time. Three categories of performance analysis are discussed.

Execution Time on Varying Value of Minimum Support
Execution times of the proposed RDD-Eclat based algorithms are compared with each other as well as with the Apriori algorithm. If we consider Eclat1 to compare with Apriori, then it can be seen that EclatV1 is at least nine times faster than Apriori on the datasets BMS2 and T40I10D100K (Figures 6.12(a) and 6.14(a)), seven times on the dataset chess ( Figure 6.9(a)), six times on the dataset BMS1 ( Figure 6.11(a)), four times on the dataset c20d10k, and two times on the datasets mushroom and T10I4D100K (Figures 6.10(a) and 6.13(a)), at the lowest value of minimum support in each case. In short, RDD-Eclat outperforms Apriori by at least two times and up to nine times on some datasets

Execution Time on Increasing Number of Executor Cores
The behavior of the proposed algorithms is investigated on the different datasets at the respective value of lower minimum support, for the increasing number of executor cores, as shown in Figure 4.15(a -e).  Execution times were determined for the five datasets utilising 2, 4, 6, 8, and 10 executor cores. The execution time of algorithms lowers as the number of cores increases. The reduction is more pronounced for algorithms that take a longer time than for those that take a shorter time (Figure 4.15(a -e)). This shows that execution time can be decreased or maintained by assigning additional cores or nodes.

5.CONCLUSION
The purpose of this article is to discuss the challenge of developing scalable and efficient frequent itemet mining algorithms for use in big data analytics. The specific and as-yet-unexplored problem at hand is re-designing the Eclat algorithm for use in the Spark distributed computing environment. We devised and implemented five variations of the Eclat algorithm using the Spark RDD framework, dubbed RDD-Eclat. EclatV1 is the earliest version, followed by EclatV2, EclatV3, EclatV4, and EclatV5. Each version is the consequence of a different technique and heuristic being applied to the preceding variant. Following EclatV1, the filtered transaction technique is used, followed by heuristics for equivalence class partitioning in EclatV4 and EclatV5. EclatV2 and EclatV3 are slightly different algorithmically, as are EclatV4 and EclatV5. Experiments on synthetic and real-world datasets.
Ethical Standards statements I. Ethical Approval: This paper is not Ethical Approval.

II. Funding Details:
There are no funding details available.

III. Conflict of Interest:
The authors declare that they have no known competing for financial interests or personal relationships that could have influenced the work reported in this paper.  The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:  The Research work has been carried out utilizing R& D Lab setup with the Department of Computer Science and Engineering, Osmania University, Hyderabad, Telangana, India.

IV.
Informed Consent There is no Informed Consent.