Knowledgebase approximation using association rule aggregation

This paper introduces knowledgebase approximation and fusion using association rule aggregation as a means to facilitate accelerated insight induction from high-dimensional and disparate knowledgebases. There are two typical observations that make approximating knowledgebases of interest: (1) It is quite often that insights can be derived based from a partial set of the samples, and not necessarily from all of them; and (2) generally speaking, it is rare that the knowledge of interest is contained in one knowledgebase, but rather distributed among a disparate set of unidentical knowledgebases. As a matter of fact, the insights derivable from knowledgebases tend to be uncertain, even if they were to be derived from a wholistic analysis of the knowledgebase. Thus, optimal knowledgebase approximation may yield the computational efficiency benefit without necessarily compromising insight accuracy. This paper presents a novel method to approximate a set of knowledgebases based on association rule aggregation using the disjunctive pooling rule. We show that this method can reduce insight discovery time while maintaining approximation accuracy within a desirable level.


Introduction
Analyzing data becomes a challenging and expensive task when data start to grow in volume, variety, and velocity [1]. The accuracy and speed of many of the common predictive techniques degrade on high-dimensional data. Abundance of data and high dimensionality may establish a more valuable resource, but entails incorporating more sophisticated predictive analysis [2]. Moreover, the absence of accurate and well-organized data or the incapability of processing large datasets may result in false and spurious insights. Several projection pursuit and manifold methods like principal component analysis (PCA) and multidimensional scaling (MDS) are used for dimensionality reduction for high-dimensional data. However, such methods typically either rely on restrictive assumptions, such as variables to be highly correlated, or take only numeric values.
Typically, the underlying knowledge of a dataset is more important than the dataset itself in designing information systems [3]. The knowledge extracted from a dataset is stored in knowledgebases which contain information at a higher level of abstraction. Knowledgebases store general facts and rules which might be deduced from thousands of data samples. Therefore, the memory requirements for a knowledgebase are much lower compared to a conventional database.
Creating knowledgebases from datasets of reasonable size is simple, but the complexity of knowledgebase generation grows exponentially with the size of the feature space, especially when typical dimensionality reduction methods are not applicable [4]. In the presence of high-dimensional data, it may not be feasible to apply complicated functions directly to the dataset. As a result, often people refrain from using some fundamental features to avoid increased computational complexity, while those features can enrich induced insights dramatically.
In the majority of cases, all the features that are needed to create a complete and comprehensive knowledgebase cannot be found in a single dataset [5]. While data integration can be used to unify disparate datasets, it does not necessarily construct a reliable dataset containing all the features in one place. Even if it does, the emerging dataset would need a more intensive processing effort to output the desired knowledgebase. As a result, smaller datasets are processed and more partial knowledge is produced everyday. Generally speaking, it is rare that the knowledge of interest is contained in one knowledgebase, but rather distributed among a disparate set of unidentical knowledgebases. As such, computational efficiency and knowledge fusion are major design concerns in insight induction systems.
This paper presents a novel approach which is inspired by the disjunctive rule of combination in the Dempster-Shafer theory and aims to approximate knowledgebases. The knowledge here is represented in the form of rules extracted using association rule mining (ARM) [6] techniques. To demonstrate the capacities of knowledgebase approximation, we apply this method to well-known classification problems and show that it successfully generates rules that approximate correlations in the input dataset. This behavior is beneficial for knowledge fusion from multiple datasets and enhances computational efficiency when dealing with highdimensional data.

Related work
An important problem in knowledgebase integration or knowledgebase approximation is how to reach a coherent piece of information when we are faced with conflicting information coming from several sources. Some methods use combination operators and are based on the union of the knowledgebases and on the selection of some maximal subsets [7,8]. These operators do not take the individual side of the merging into account as the source of information does not matter in the combination process. Hence, some are motivated to use a selection function to choose among those that best fit a merging criterion [9,10]. An option here is using drastic majority operator, in which the distance between two knowledgebases is considered to be drastic, i.e., the distance is set to 0 if the conjunction of the two is consistent and 1 otherwise. It is apparent that this operator is very rough, so a more subtle way is using cardinality operators [11,12].
In the past few years, the necessity of knowledgebase approximation and integration is felt more than ever with the explosion of data in various fields of research. For semantic analysis of social media feeds, for example, integration of local knowledgebases is explored and shows improvement in sentiment analysis and detecting events in social media streams [13]. The idea has also been emerged for learning of knowledgebase representations where the feature types are integrated with a combination of neural representation learning and probabilistic product of experts models [14].
Knowledgebase integration is also a popular topic of investigation for web search engines as the required information of users could be distributed in the databases of various search engines [15].
Some researchers applied approximation to probability models. Chow et al. [16], for example, used approximation for discrete probability distributions with dependence trees, and Kappen et al. [17] used second-order approximations for probability models. For approximating knowledgebases, on the other hand, the idea has been developed in the form of knowledge compilation or approximate knowledge fusion. Selman et al. [18], for example, used horn approximations for knowledge compilation, Martires et al. [19] investigated Knowledge compilation with continuous random variables and its application in hybrid probabilistic logic programming, and Dangdang et al. [20] explored knowledge compilation methods based on the clausal relevance and extension rule.
Despite the advantages that each of these methods offer, the initial problem of reaching a coherent piece of information from contradictory sources without compromising computational complexity remains in place. While handling the conflicts is dependant to the content of the datasets, computational complexity of knwoledgebase integration methods is understudied and can be improved substantially. Our observation from the review of the literature is that adoption of fusion techniques can maintain the breadth of integrated knowledge and reduce the computational complexities. Dunin-Ke et al. [21], for example, used the horn fragment of serial propositional dynamic logic to perform tractable approximate knowledge fusion and showed that the obtained formalism is quite powerful in applications. In this study, we incorporate an information fusion technique along with association rule mining to benefit from both improved conflict handling and reduced computational complexity.

Methodology
The knowledgebase approximation method proposed in this paper uses association rules to represent knowledge. Association rule mining techniques are the tools that extract association rules, and this study incorporates them for knowledge induction. Association rule mining is widely known as a tool for market basket analysis [22], but is applicable to a wide variety of datasets in different domains such as medical diagnosis [23], bioinformatics [24], web mining [25], and traffic accident analysis [26]. This section presents the methodology for our proposed knowledgebase approximation technique on the basis of combination of evidence in the form of association rules.

Knowledge induction
Association rule mining (ARM) [6] is a fundamental data mining technique with the ability to uncover hidden relationships in a rational dataset, while the data might seem unrelated. The set of association rules generated by ARM divulge insights describing the underlying relations in a dataset, even if the relations are implicit [27]. An association rule is denoted by A → B with A and B being sets of attributes known, respectively, as the antecedent and the consequent of the rule. A rule A → B is an if/then statement indicating that if A happens then B also happens. A and B are sometimes referred to as itemsets since they can contain a set of disjoint items.
The association of dataset's attributes is mainly found by determining how frequent they appear together in a dataset. Several algorithms have been proposed for this purpose to mine frequent itemsets. The most commonly used algorithms are Apriory, Equivalence Class Clustering and bottom-up Lattice Traversal (Eclat), and Frequent Pattern (FP) Growth [28][29][30]. The frequent itemsets are then converted into associations rules by identifying the antecedent and consequent sets.
The three main measures of significance, based on which the ARM applies constrains and selects the interesting rules, are support, confidence, and lift. Equations (1), (2) and (3), respectively, calculate the support, confidence, and lift of a rule A → B in a dataset. Support of a rule A → B indicates the percentage of all itemsets in which A and B occurred together. Confidence of that rule indicates the percentage of itemsets in which when A occurs then B also occurs. Therefore, the former is used to find the frequent itemsets and the latter to filter the strong rules. Lift is also used in the ARM algorithms to measure the correlation between A and B in a dataset. A higher lift value (greater than 1) indicates that A and B appear together more frequently, whereas a lower value (lower than 1) shows reverse of this concept. A lift value 1 would imply that A and B are two independent events and no association rule can involve both events together.
Interesting rules are typically identified based on a minimum support threshold and a minimum confidence threshold. If a rule satisfies both thresholds, it is considered as interesting. The thresholds are set empirically in a way that the number of useful rules is maximized and that of the useless ones minimized.

Combination of evidence
This paper proposes association rule aggregation to represent approximated knowledgebase. We define association rule aggregation as combining the antecedents of multiple association rules in different datasets with the same consequent. A simple example is illustrated in Eq. (4).
where DB1 and DB2 are two different datasets and B is an attribute found in both datasets. DB12 is a dataset that consists the attributes of both DB1 and DB2. In association rule integration, we are interested to know if A ∩ B ∩ C → P is a valid expression when DB1 and DB2 are integrated. This problem can be regarded as an extension of combining theories using intersection (see [31]) where there are just "p" formulae in the rules' heads.
We chose combination of evidence to tackle this task which is mainly known from an information fusion technique in the context of the Dempster-Shafer (DS) theory [32]. DS theory, often described as an extension of the probability theory or a generalization of the Bayesian inference method [33], offers an alternative for mathematical representation of epistemic uncertainty. As opposed to the traditional probability theory, where evidence is associated with single events, DS theory deals with evidence associated with sets of events and probability values assigned to sets of possibilities. DS theory works at higher levels of abstraction by adding a third aspect, called unknown, to the crisp logic. The basic idea is built upon obtaining degrees of belief from subjective probabilities and combining them using their independent items of evidence [34].
The three main functions used in the DS theory are the basic probability assignment function (BPA or m), the Belief function (Bel), and the Plausibility function (Pl). In this paper, we are interested in generating rules' strength using the BPA function, but the application of Bel and Pl is not desirable here. We discuss the reason in Sect. 3.3. The BPA function assigns masses to all subsets of the entities in a system by mapping contents of the power set (P Ω ) to the interval between 0 and 1. The mass of subset p i is commonly denoted by m( p i ) and represents the amount of knowledge associated with that subset. In other words, m( p i ) expresses the proportion of all available evidence that supports p i but no particular subset of it. Each element p i ∈ P Ω is called a focal element of P Ω if m( p i ) > 0, and the set of all focal elements is named a body of evidence (BOE). The following three equations represent the above description of m: When multiple independent BOEs are available, which assumes existence of independent generic sources of information, we can use Dempster's rule of combination (DRC) to compute the aggregated BPA on p i [35]. Having two independent events p a and p b with their BPAs expressed by m 1 ( p a ) and m 2 ( p b ), DRC can be applied as follows: is a normalization constant called conflict degree and represents the amount of conflicting evidence between the two sources of information. DRC is purely a conjunctive operation which is AND-based and operates on set intersection. In the situation where not every source is reliable and at least one reliable source exists, a modified DRC, known as disjunctive pooling rule (DPR) [36], is more appropriate. As opposed to DRC, DPR is OR-based and operates on set union. DPR does not reject any of the information asserted by the sources and does not generate any conflict. It can be applied to two independent events p a and p b using Eq. (9): DPR is more robust than DRC in the presence of conflicting evidence, and its use is appropriate when the conflict is due to poor reliability of some of the sources. In other words, DRC works based on the assumption that the belief functions to be combined are induced by reliable sources of information, whereas the DPR only assumes that at least one source of information is reliable, but we do not know which one. Both rules assume the sources of information to be independent. DPR is defined based on the union of the basic probability assignments (BPA) by extending the set-theory union and hence is an appropriate operator for insight aggregation [37]. Some other characteristics of DPR that recommend it for this purpose are as follows: -Unlike conjunctive pooling, disjunctive pooling incorporates all the information asserted by the sources rather than selecting the part which is in consensus. -The union does not generate any conflict -No normalization procedure is required -DPR is commutative and associative, but not idempotent

Knowledgebase approximation framework
The focus of our approach for knowledgebase approximation is on integration of knowledge, which is drawn in the form of if/then rules using the ARM method, from smaller datasets with fewer features. These smaller datasets may be obtained from different data providers with their own objectives. In this case, approximating the knowledgebase corresponding to the integrated dataset would save the hassle of dataset integration and processing the bigger emerging dataset. Nonetheless, any large dataset can be broken into smaller ones by selecting only certain features to appear in each of them. As a result, we just need to deal with multiple lower-dimensional datasets requiring lower computational efforts.
Let us assume that N independent datasets are available for investigation indicated by DB i and i ∈ {1, . . . , N }. These datasets are the main sources of information from which we aspire to obtain an approximated knowledgebase comprising all the attributes appeared in any of the datasets. Any pair of the datasets may share common features. Hence, the dimension of the corresponding integrated dataset is not necessarily equal to the sum of the number of features. At the end of this section, we will show that common features help the DPR method to find a good approximation.
In order to induce knowledge from the smaller datasets, ARM is applied to each dataset which generates N independent rulesets. ARM explores and connects the attributes that contribute in the occurrence of a particular event or a set of events. Depending on the size and nature of the datasets, different ARM methods can be used. Given a minimum support threshold and a minimum confidence threshold, ARM finds all the strong association rules, that is, those whose confidence and support values are equal or greater than the thresholds. A rule that does not meet the thresholds is called a weak association rule.
Having mined all the association rules from the available datasets, there will be N independent rulesets available which are denoted by RS i and i ∈ {1, 2, . . . , N }. If the antecedents of the rules with the same consequent in different rulesets satisfy the requirements of the DS theory to be regarded as pieces of evidence, they can be assigned mass values and combined as instructed in the DS theory. The rules are mined from independent datasets and hence satisfy the independence of information sources. The mass values of the rules' antecedents can be simply a weighted average of the ARM interestingness measures including support, confidence, and lift. Since the summation of mass values for a specific event should be equal to 1, these weighted averages are normalized over all the rules that point to a particular consequent. More sophisticated BPA mappings can be defined depending on the degree to which domain knowledge is available.
Since ARM restricts the rules to those satisfying minimum support and confidence thresholds, masses corresponding to the emerging rules can be assumed to have nonzero BPAs and hence regarded as focal elements. Consequently, by mapping the interestingness of the rules to mass values, the rulesets can be transformed into independent bodies of evidence (BOEs) to which combination rules can be applied. In other words, the mass value assigned to a focal element is proportional to its generating rule's strength or interestigness. This paper incorporates DPR to combine the independent BOEs obtained from the lower-dimensional datasets. DPR is a union-based operator and unlike DRC, which selects a condensed part of evidence, selects an extended piece of evidence based on the number and weights of the BOEs that can shape that extended piece of evidence in aggregate. In our case, pieces of evidence represent association rules and extending them will generate a rule with a larger number of antecedents. To merge the antecedents of multiple rules, their consequents should be the same, as shown in Fig. 1. Therefore, the rules are filtered into groups in advance based on their consequents and the process of rule to BOE transformation and applying DPR is performed for each group separately.
As illustrated in Eq. (9), DPR uses the values of BPAs in different domains to find a fused set of masses assigned to the higher-dimensional domain. The equation implies that In a simplified version of the problem when the BPA values are disregarded, the strength of association rules is dismissed and all rules will have the same impact on generation of the elements in the fused set. Figure 2 elaborates on how DPR differentiates between strong and weak extended rules when BPA values are not considered. In this figure, DB q − R k represents rule number k induced from DB q and DB 1 − R i1 is the first rule found in DB 1 that can generate a specified extended rule in aggregate with another rule DB 2 − R i2 found in DB 2 . In this figure, strong rule is an extended rule which reproduced a good many times by the union of different pairs of rules, one from DB 1 and one from DB 2 . In contrast, weak rule is reproduced relatively infrequent, which means not many rules from DB 1 and DB 2 could be augmented in their antecedents to form that extended association rule.
In the real scenario, the BPA values are not disregarded and each association rule in the rulesets is assigned a mass corresponding to the rule's strength. What changes in this case is that the strength of extended association rules is not measured only based on the count of reproduction times. Instead, the multiplications of the masses for any pairs of rules whose antecedents' union can reproduce the specified extended rule are accumulated. This procedure can also be applied to more than two datasets where the union should be conducted on a combination of rules from disparate datasets. Figure 3 outlines the proposed method in which N independent datasets are assumed available for investigation. Mining transaction datasets for association rules typically generates a large number of rules. When ARM is used for subsequent prediction, most of the rules become unnecessary and can be eliminated. In our method, however, we utilize every generated rule that satisfies the minimum support and confidence thresholds. Using the informative rulesets along with their dependant rules helps the DPR method to better sort the extended association rules based on their strength. When the fused rules and their corresponding BPAs are created, then we can keep the informative ruleset and eliminate the dependant rules.
Rules' masses in our method are obtained by multiplication of their support and confidence, and normalizing them over the whole ruleset. Let us assume q rules are mined from a ruleset. If the rule r i has the confidence conf(r i ) and support sup(r i ), then its mass is calculated using Eq. (10).
When all the rules in N rulesets transformed into BOEs, DPR can be applied to integrate them into a single set of probability mass assignment indicated by fused evidence in Fig. 3. This set is a combination of masses attributed to both strong and weak rules, but we can easily prune the masses and keep the stronger ones. To do so, we should consider the BPA in the fused set as a measure that is proportional to the multiplication of confidence and support of a rule that generated it. We refer to this measure as rule strength.
In the DS theory, the probability of a proposition falls between the belief and plausibility values. In our case, however, we deal with rules' strengths, as opposed to probabilities. These strength values are deterministic and proportional to the fused BPAs and, as opposed to probabilities in the DS theory, do not fall between the belief and plausibility values. Hence, our method is inspired by the DS theory and not based on it. We select those BPAs in the fused set that can generate rules with a strength more than a minimum threshold and regard them as dominant BPAs. Then, we use the dominant BPAs to generate the integrated insights. The minimum strength threshold (MST) for this purpose is calculated using the minimum confidence and support of the rules in the original rulesets as indicated in Eq. (11).
If the BPA of a combined rule is less than the MST threshold, it is considered a weak rule and hence will be pruned.

Application to pattern recognition
Associative classification [38] is an integration of association rule mining and classification which has been investigated widely in pattern recognition. Previous studies show that associative classification can achieve a high classification accuracy and is highly flexible at handling unstructured data [39]. Among the algorithms proposed for classification based on multiple-class association rules, CMAR and CPAR are shown to have competitive performance based on the experimental results in [40,41]. CPAR generates its own set of predictive rules directly from the dataset, while CMAR selects a small set of high confidence rules from previously mined rules. Hence, in this paper, we apply CMAR to the association ruleset in an approximated knowledgebase that is obtained by our association rule aggregation method and show that the accuracy of classification can be maintained when certain number of attributes are in common between two datasets. In other words, a knowledgebase can be approximated by applying our method to its corresponding lower-dimensional datasets when there is enough information shared among them.
We have used the lymphography dataset [42], along with six other datasets obtained from UCI (university of California Irvine) machine learning database, to evaluate our knowledgebase approximation framework. We use the lymphography dataset as an example to show how this approximation method is applied to a dataset, but we will use all seven datasets to show the effectiveness of our approach in terms of approximation accuracy and run time.
The lymphography dataset was recorded at the University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia. It contains 148 instances and 19 numerical valued attributes related to different aspects of lymphographic clinical data including the class attribute. Table 1 describes these attributes and the values used for them in the dataset. The class variable holds any of the four cases of normal, metastases, malign lymph, and fibrosis.
As described in the previous section, this paper adopts ARM for extracting knowledge in the form of association rules. One of the criteria in ARM to select interesting rules is support which targets those rules that their components appear in the dataset adequately. Among the existing classes in the lymphography dataset, two classes of normal and fibrosis contain very few samples and cannot satisfy the support measure. Therefore, our investigation is limited to the classes of metastases and malignant.
In order to approximate the knowledgebase for this dataset, we divided it into two smaller datasets in the feature space. This division was done nine times, and each time the percentage of common features was increased.   Once the original dataset is divided, ARM is applied to each smaller dataset to extract rulesets as illustrated in Fig. 5. Before proceeding with integration, the rules were filtrated to separate the rules with consequences of metastases and malignant as shown in Fig. 1. The smaller rulesets with matched consequents are then independently aggregated based on our proposed integration framework to create two sets of extended rules with metastases and malignant as the consequents. These two sets can then be blended as a single set indicated by fused rules to represent the information in the approximated knowledgebase.
As shown in Fig. 5, the ruleset from the original dataset was also induced as the ground truth for evaluating the proposed approximation method in this case study. As mentioned at the beginning of this section, the accuracy of associative classification is used to compare the knowledgebase and its approximation. CMAR classification method is applied to the fused ruleset and the ruleset induced directly from the original ruleset to compare the classification accuracy. For the classification purpose, 10-fold cross-validation was used with the 70% percent of the original dataset's samples for training. The remaining 30% was used as the test set. The 70:30 data split ratio was selected empirically and produced the best classification accuracy among the commonly used data split ratios in the field of machine learning. The same approach is applied to the other six UCI datasets and reported in Sect. 5. Our knowledgebase approximation method can also be applied to more than two datasets. In this case, there are two approaches available that have the same outcome. One is to apply the fusion operation to all datasets at the same time, and the other is to integrate rulesets in a cascade manner, i.e., the resultant approximated ruleset of each two is fused with the next ruleset.

Results and discussion
In this study, we explored the Apriori, FP-growth, and Eclat algorithms, which are the best-known ARM methods and are all simple enough for extracting association rules in a cohort dataset. However, other algorithms could be incorporated to generate insights for the proposed insight aggregation method. Among these three explored algorithms, we found the Apriory algorithm faster and more accurate (Figs. 6 and  7). What makes Apriory a good candidate is that it uses a bottom-up approach, i.e., one item is added to the frequent itemsets at a time and tested against the data. Also, the breadth-first search nature of this algorithm makes it suitable for finding desired rules without considerable computation complexity from the small dataset in use here. Furthermore, it is easier to store the dependant rules of those in the smallest informative ruleset.
Based on empirical experiments, we set the value of support to 25%. A reasonable value for support is essential for identifying the rules worth considering for further analysis. If an itemset happens to have a very low support, it will not provide enough information on the relationship between its items, and hence no concrete conclusion can be drawn from such a rule.
The number of instances in the lymph dataset is 148 and a support of 25% guarantees that selected itemsets show up together in at least 37 instances. Although we can find many rules with a confidence value of 100% if we decrease the support threshold, those rules are not necessarily useful since there are not enough occurrences of their itemsets. To better understand this fact, suppose there are only two instances of items A, B, and C happening together. It is not infrequent that a rule with confidence of 100% is derived from it. In contrast, if the number of instances is high (like 10K or more), it makes sense to lower the support because even 1% in such case still has many (1000 and more in the 10k example) instances.
As the goal of lymph classification was to relate predictor variables to the occurrence of two classes of metastasis and malignant, all predictor variables are limited to appear only in the antecedent (IF part), and lymph classes (outcome variable) to appear only in the consequent (THEN part). To generate all the strong association rules, we conducted our analysis by selecting any rule satisfying initial support threshold of 25% and confidence threshold of 40% for the generation of frequent itemsets and rule induction. For each of the two smaller datasets, two types of rules were extracted for metastasis and malignant, separately. We used orange3 in python 3.4 to apply the association rule algorithm.
After approximating the original dataset's knowledgebase by integrating the association rules from the smaller datasets, CMAR associative classification was applied to the aggregated rules. We applied CMAR to the variation in common features and used 10-fold cross-validation as the original data samples were limited. The accuracies are reported in Table 2. The accuracy of CMAR in classifying the original dataset is 83.1% which is used to normalize the approximation accuracy. In other words, an approximation accuracy of 100% indicates that the maximum possible classification accuracy is obtained, i.e., 83.1%.
As discussed in the previous sections, the processing time for finding association rules, or in general inducing insights, increases exponentially by the size of feature space. Our approximation method decreases this processing time significantly while the approximated knowledgebase is highly accurate when there are enough features in common. The average processing time for the nine experiments reported in Table 2 using our method is 0.16 s, while the run time of extracting association rules from the original dataset is about 1.1 s. It is worthwhile to mention that the lymph dataset is a relatively small dataset containing only 143 samples, 18 features and 4 classes. The run time in our method is ten times less. It is trivial that using our method for bigger datasets can save much more time.
Tables 3 and 4 report the number of correct and incorrect recovered rules in each experiment for the consequents of metastasis and malignant separately. The basis to distinguish correct and incorrect rules is the smallest informative ruleset induced from the original dataset. This basis is selected because our initial goal was to approximate the knowledgebase from the original dataset and choosing this basis conforms to our goal. In these tables, key features are those with higher relevancy to the consequents and can better distinguish the outcome. In the lymph dataset, seven features exist that can be used as key features to differentiate between a metastasis and a malignant lymph. Table 5 shows the summary of results for all seven UCI datasets when our approximation method is applied. The run time and accuracy of the approximation method are the average of run times for all nine experiments in which the number of common features is increased from 10 to 100% in step sizes of 10%. As can be seen from the table, the averaged approximation accuracy drops when the CMAR accuracy increases. The reason is that the CMAR accuracy is used to normalize the approximation accuracy, and hence, the real accuracy of classification on the approximated dataset remains in the acceptable range of 70-90%. Another considerable observation is the run time of the classification on the approximated knowledgebase which is significantly lower than the CMAR run time on the original dataset.
We compared the approximation accuracy and run time of the approximation method using six different combinations of rule generation and rule aggregation. All these combinations follow the same procedures described in this paper, but differ in either using DRC instead of DPR to combine the BPAs, or using a different ARM method to generate the frequent itemsets. Figures 6 and 7 illustrate the accuracies and run time of these methods. As explained in Sect. 3.2, DRC is a conjunctive operation which is AND-based and operates on set intersection. DRC works based on the consensus of the evidence and hence reports the antecedents of the fused rules as the items that are in agreement with many of the rules in the initial datasets. This makes the fused rules to shrink, which is not ideal for integration of knowledge, and lowers the accuracy of knowledge approximation which can be seen in Fig. 6. It also affects negatively on the run time as it has to compute rule conflicts every time a fused BPA is being generated. Some of the combinations in Figs. 6 and 7 use FP-Growth or Eclat to generate their association rules. FP-growth algorithm is the method of finding frequent patterns without candidate generation. It constructs an FP Tree rather than using the generate and test strategy of Apriori. Eclat, on the other hand, uses Transaction Id Sets (tidsets) intersections to compute the support value of a candidate and prevents the generation of subsets which do not exist in the prefix tree. According to Fig. 6, the DPR-apriori method outperforms both FP-growth and Eclat methods no matter which com-bination rule they are being used with. The Eclat method, however, competes with the apriori method in the run time ( Fig. 7) since it is an efficient ARM method which works in a vertical manner similar to a Depth-First Search strategy in a graph.
As previously noted in Sect. 4, the DPR-based knowledgebase approximation can be applied to more than two datasets and there are two approaches to apply the fusion operation with the same outcome. Breaking the dataset into more partitions can boost the processing speed even further by enabling parallel computation on these partitions. We investigated the effect of number of partitions on the run time and accuracy of classification using the approximated Lymph dataset. First, we increased the number of partitions when all partitions are in use and monitored the run time and accuracy. Second, we fixed the number of partitions to six and started using only a specific number of those six partitions in the range of 2-6. The results of these two cases are shown in Figs. 8 and 9, respectively.
In the first case, the approximation accuracy starts to drop as the number of partitions increases. The reason is that the approximation takes place based on smaller pieces of the dataset and hence less information is contained in the rulesets. Table 5 The comparison of CMAR accuracy and run time when applied to the original dataset and the approximated knowledgebase As a result, the in-common rules become smaller and smaller until only the key features start to drive the classification. This can be seen in Fig. 8b when the slope of the connecting lines start to decrease. The same thing happens to the run time. The run time starts to get smaller until there are no more available processing units to extract the rules of new partitions in parallel with the previous partitions. In contrast, the approximation accuracy and run time increase in the second case as the number of partitions in use increases. This is trivial since more information becomes available when more partitions are in use. The effect of number of partitions in this case is illustrated in Fig. 9a, b. It is worthwhile to mention that the dataset in this case is partitioned against the second dimension, i.e., the samples, as opposed to the previous case where the dataset was partitioned against the feature space.

Conclusion
Knowledge discovery is becoming a central issue for industrial and government organizations. The ability of these organizations to conduct their business, effectively and efficiently, is heavily dependent on insights they derive from knowledgebases relevant to their businesses. This has led to the emergence of insights deduction systems as an important computing discipline. It is typical that Big Data Knowledgebases tend to be disparate with high dimensionality. In the presence of high-dimensional data, it may not be feasible to apply complicated functions directly to the dataset. As such, computational efficiency and knowledge fusion are major design concerns in insight induction systems. This paper introduced a knowledgebase approximation methodology to address two challenges in data analysis: (1) association rule mining efficiency at handling huge datasets, and (2) integration of induced rules from disparate datasets without the need for integration in data level. We proposed using the DPR approach along with the BPA assignment to combine the rulesets and assigned a new measure of interestingness, i.e., rule strength, to the fused rules. The fused rulset is an approximate knowledgebase for the whole data available in disparate datasets.
Our experiments on the lymphography and six other datasets in the UCI machine learning database repository show that our knowledgebase approximation method can achieve high accuracy when the number of in-common features between two smaller datasets is above 60%. DPR investigates the integration of knowledge as opposed to the consensus of knowledge in DRC, and therefore, it does not need to manage the conflicts in the rules. However, the impact of conflict in the approximation accuracy can also be investigated as a future work.
There are other union and intersection-based integration methods that fit to the model introduced in this paper and may perform better depending on the case study that this model is applied to. We leave investigation of those other models as a future work. There are also other works on combined pattern mining [43][44][45], action rules [46,47], and actionable pattern mining [48] that may be relevant and can be explored as potential expansions of the current work.