Combinatorial Association Mining Method for High-Dimensional Data

Aiming at the problems of low mining efficiency, strong subjectivity and too many association relations generated by classical mining algorithms, a novel algorithm for association rule mining of high-dimensional data is designed in terms of both sample selection and association rule generation. The algorithm reduces the impact of weak samples at the beginning of mining by calculating the distribution coefficients and deletion thresholds of the samples and synthesizing custom support to double screen the samples at the first reading of the dataset. When generating frequent items and association rules, the algorithm mines information in a sample relationship table and sample the full relationship combination mode, which reduces the complexity and resource consumption of the mining process. The experimental results show that the number of frequent items and association rules mined by the Marc algorithm is significantly reduced, and the mining efficiency and memory consumption of the Marc algorithm are better than those of the Apriori, FP growth and Eclat algorithms. The higher the dimension is, the larger the data set is, and the more obvious the advantage is. The accuracy of the Marc algorithm for mining frequent items and association relations is 100%.


Introduction
Association rule mining 1 is one of the main tasks of data mining 2,3 and one of the most important and active research areas in the field of data mining 4 . It was first applied to customer behavior mining, and now it is gradually extended to various fields 5,6 , especially in the analysis of users' business behavior, which has very important practical significance. The classic association rule mining algorithms mainly include Apriori 7,8 algorithm, FP-Growth 9,10 algorithm and Eclat 11,12 algorithm, the Apriori algorithm requires multiple scans of the dataset and suffers from low mining efficiency and redundancy of candidate sets 13,14 , although FP growth reduces the number of data scans and improves the mining efficiency 15,16 , it also has the problem of generating a large number of headers and repeatedly building FP Tree, which leads to serious memory consumption 13 , Eclat algorithm is a vertical data format mining algorithm that uses depth-first mode for frequent pattern mining without repeated traversal of the data 17 , but when the data scale is large to solve the support degree, it will consume many memory and reduce the mining efficiency 18 . In addition, the classic association rule mining algorithm 19 requires artificial setting of support and confidence, which is subjective and affects the effect of data mining [20][21][22] . One of the key factors to improve the efficiency of mining is to effectively filter the data according to the distribution of the data set 23 . Second, in most cases, users are only interested in a certain part of the relationship between samples, but Apriori, FP growth and Eclat all try their best to mine all the association rules in the sample set 24 , which reduces the accuracy of the mining algorithm. The quality of the association rule mining algorithm is closely related to the mining efficiency, memory consumption and algorithm complexity 25 . However, according to the experimental results of this paper, when the sample set dimension exceeds 15, the mining efficiency of the above three algorithms is significantly reduced, the memory consumption also increases sharply, and the number of association rules mined is very large. In response to the above problems, this paper proposes a Mining Multidimensional Association Rules By Combination (Marc) algorithm suitable for highdimensional data. The core idea of the algorithm is to introduce the concept of overall data distribution, calculate the distribution coefficient and delete threshold, increase the data screening process, build a sample relationship table, and combine the minimum support and minimum confidence set by users to mine the data. After experimental analysis, the algorithm is better than the Apriori, FP-Growth and Eclat algorithms in mining efficiency, memory consumption, number of generated association rules and number of generated frequent items, and it is more in line with data expectations.

Principle of Marc algorithm
The Marc algorithm mainly solves the problem of high-dimensional data mining. When dealing with samples, it is mainly based on the influence of the distribution of samples on frequent items and association rules and the premise that the expansion of full permutation of samples does not change the actual frequent items and association rules. Based on the above premise, the innovation of the Marc algorithm has the following two aspects.
(1) Secondary filtering during data reading The Apriori, FP-Growth and Eclat algorithms perform data filtering based on the minimum support set by the user during the initial read of the data, which is more subjective. The Marc algorithm introduces two indicators: the sample set distribution coefficient and deletion threshold. When the data are read for the first time, secondary screening is added to the initial samples according to the distribution of the samples. The occurrence frequency of the reserved samples must be greater than the minimum support and deletion threshold at the same time. This step is conducive to reducing the number of weak frequent items and weak association to improve the efficiency and focus of mining.
(2) Reconstruction of sample combinations based on a sample relation table The core of the Marc algorithm mining frequent items and association relations is to construct the relationship table of samples. Then, according to the relationship table, the samples in the sample set are fully arranged and combined, the dimension of the sample set is expanded, the association relations between samples are transformed into full relationship combinations, and then the combined samples are mined. Compared with FP-Tree organization, the all-relational approach takes up less memory, has lower processing complexity, and is more convenient for calculating support and confidence in high-dimensional data mining, which is more conducive to the mining of massive high-dimensional data.

Algorithm index
(1) Support: Indicates the percentage of all transactions that contain both A and B.
(2) Confidence: Indicates the proportion of transactions that use A that also contains B, i.e., the proportion of transactions that contain both A and B to the proportion of transactions that contain A.
(3) Distribution: The distribution coefficients are used to calibrate the dispersion of the samples. The distribution coefficients of the Marc algorithm are obtained using SigMod variance or standard deviation planning of the samples.
The value of k is the variance or standard deviation of the sample, and the specific choice is determined according to the actual needs.
(4) Trim Threshold: It is obtained by the product of the distribution coefficient Distribution and the sample mean E. It is mainly used to remove the weakly frequent terms in the sample.

Procedure
The core idea of the Marc algorithm is to consider the overall distribution of data when screening data, build a relationship table between samples, recombine the sample data based on the relationship table, and mine frequent items and association relationships. The mining process of the Marc algorithm is shown in Fig. 1. (1) According to the minimum support and the calculated deletion threshold, the original data set is filtered, and the data set is arranged in descending order according to the sample frequency.
(2) Scan the sorted data set, record a single sample and the number of occurrences of a single sample to form a sample frequency set. Then, the samples in the sample frequency set are combined two by two, the number of occurrences of the sample combinations are counted separately, and the sample combinations with less than the minimum support are deleted to form the sample combination frequency set.
(3) Generation of sample relationship tables based on sample combination frequency sets.
(4) Scan the sorted data set generated in (1) and rearrange the samples according to the sample relation table to form a new sample data set.
(5) Scan the dataset generated in step (4), count the frequency of combined samples, calculate the support and confidence of combined samples, and generate frequent items and frequent patterns.
The detailed steps of the Marc algorithm are as follows: (1) Set the minimum support to min_support and the minimum confidence to min_confidence. represents the set of samples with rows and dimensions.
where is the largest dimension and contains samples less than or equal to . (2) Traversing the sample set , count the number of each sample in the sample set, marked as . The total number of samples is .
As shown in Fig.2,Delete is less than the minimum support of the sample, and according to in descending order.

Fig. 2 sample frequency statistics and sorting
(3) Calculate the mean , variance 2 and standard deviation of the sample frequencies.
(4) Use the SigMod function to plot the standard deviation or variance to derive the distribution coefficient .
Since the variance and standard deviation reflect the degree of sample dispersion, they can be used as indicators to screen the original sample. The purpose of planning by SigMod is to plan the dispersion of this sample between [0,1], which is more convenient when solving for the deletion threshold. The specific index used to calculate the distribution coefficient can be determined according to the actual situation. It should be noted that since SigMod has a small range of variables, as shown in Fig. 3, the results are very close to 1 when the variable value is less than -6 or greater than 6. Therefore, to better filter the samples, it is recommended to select the variance when the sample variance is large and the standard deviation when the sample variance is small.
represents the i-th sample, and represents the frequency of the i-th sample. (7) The samples in the set are combined in pairs to form a new set, which is recorded as ′ . Count the frequency of the combined samples in the set ′ , and delete the samples that are less than the minimum support.
represents the combination of the i-th sample and the j-th sample, and represents the frequency of the i-th sample and the j-th sample. (8) According to the set of ′ , generate the sample relationship table ′ .
The specific generation method is based on a sample in the combination of samples to find all the samples related to it that exists in the ′ set.
′ is composed of multiple , ′′ is equal to ′ , and ′′ ≥ ′ '.  , and the confidence is recorded as , delete samples smaller than the minimum support and minimum confidence.
(12) Generate frequent item sets and association relationship sets, arranged in descending order of support and confidence.
The process of mining frequent rules by Marc algorithm is shown in Fig.4.

Experimental analysis
To facilitate data mining, the degree of support is replaced by the degree of support. The minimum degree of support set in this experiment is 5, and the minimum confidence is 0.5. The Python version of the experimental environment is 3.8.5. The experimental equipment configuration is shown in Table 1.

Experiment procedure
(1) Build a sample set Construct 75 groups of test sample set, the maximum number of rows in each sample set is 300, the highest dimension of each row sample is 25, the number of rows starts from 20, and each increase of 20 rows is a group, and the dimension starts from 10 and each increase of 5 dimensions is one group. The specific generation method is as follows: first generate 300 rows of 10-to 25-dimensional samples, and then divide the samples according to the number of rows based on the samples.
To focus the experimental samples, their generation follows the following rules: ① Assuming that the odd-numbered items and odd-numbered items in the sample are subscripted, the even-numbered items are related to the even-numbered items, and the sample set is generated alternately according to odd and even; ② Set the maximum number of dimensions of the sample set to dimensions, and the total number of samples in a row of records is at most dimensions; ③ The number of noise samples in each row is noise, and the noise value is randomly set between 0 and 3. The selection rule for noise samples is that when generating odd samples, the noise samples are even samples, and when generating even samples, the noise samples are odd samples.
④ The samples in the sample set of the same row are not repeated.
(2) Marc, Apriori, FP-Growth and Eclat were used to mine the above test data, and the mining time, memory consumption, number of frequent items and number of association rules generated were recorded.
(3) Analyze the experimental results and analyze the results of (2).

Analysis
This paper selects four indicators of algorithm execution time, memory consumption, number of frequent items and number of frequent relationships to make a comparative analysis of MARC, Apriori, FP growth and Eclat.
(1) Time The comparison of mining time of the four algorithms in different dimensions is as follows:

Fig. 5 Comparison of mining time
Although the dataset generation conditions are set, it still has a certain degree of randomness. Therefore, some test results show negative correlation characteristics. However, Fig. 5 shows that the mining time of the four algorithms is generally positively correlated with the number of rows and dimensions. The mining time of the Marc algorithm is slightly higher than that of FP-Growth when the number of data rows is small and the dimensionality is low (as in Fig. 5a), which is because the Marc algorithm needs to generate sample relationship tables and sample full relationship combinations. Overall, the Marc algorithm has a lower running time than all the other three algorithms. In the other three algorithms, when the data dimension is 10, the time consumption of the Apriori algorithm is higher than that of FP growth and Eclat (as shown in Fig. 5a), but when the data dimension is increased, the execution time of the Apriori algorithm is lower than that of FP growth and Eclat. In terms of the growth rate of mining time, it can be seen from the figure that the growth rate of mining time of Marc algorithm and Apriori algorithm is more stable than that of FP growth and Eclat, while FP growth and Eclat significantly increase the mining time with the improvement of data dimension, and the mining time of peer number sample set increases exponentially with the growth of dimension. Through comparative analysis, it can be seen that compared with the other three algorithms, the Marc algorithm has higher mining efficiency, and the higher the data dimension and the number of rows, the more obvious the efficiency improvement. The FP-Growth and Eclat algorithms are suitable for mining low-dimensional data, the mining efficiency of both is lower than Apriori when the data dimension is higher than 10 dimensions, and the efficiency of Eclat mining is more volatile when the data dimension is 25.
(2) Memory The comparison of memory consumption of the four algorithms is as follows: Fig. 6 Comparison of memory consumption As shown in Fig. 6, the memory consumed during Marc mining is generally smaller than that of the Apriori, FP-Growth and Eclat algorithms, and the higher the sample dimension is, the more obvious the memory advantage of the Marc algorithm. For low-dimensional data mining, Marc consumes more memory because it generates sample relational tables and full relational combinations, but for high-dimensional data mining, the other three algorithms' data organization formats consume much more memory than Marc's algorithm's relational tables and full relational combinations. In terms of the growth rate of memory consumption, the growth rate of the Marc algorithm is much smaller than that of the other three algorithms, and the growth rate of memory consumption of the Marc algorithm tends to be stable with increasing dimensionality and number of rows. Therefore, the resource consumption of the Marc algorithm in high-dimensional data mining is much lower than that of the other three algorithms, and the stability is higher than that of the other three algorithms. (

3) Frequent items
The comparison of the number of frequent items generated by the four algorithms is as follows: . 7 Comparison of the number of frequent items Fig. 7 shows that the number of frequent items mined by the four algorithms tends to stabilize as the number of sample rows increases. The number of frequent items mined by Apriori, FP-Growth and Eclat algorithms is comparable, and the number of frequent items mined by Marc algorithm is much lower than that of FP-Growth, Apriori and Eclat, and the number of frequent items mined by Marc algorithm gradually tends to be stable with the increase of the number of rows at the same latitude, which is more in line with the actual situation of the data.
(4) Association rules A comparison of the number of associations generated by the four algorithms is shown below:

Fig. 8 Comparison of the number of associated rules
It can be seen from Fig. 8 that the number of association relationships mined by the Marc algorithm is less than that of the other three algorithms, and the higher the dimension is, the more obvious the gap. With the increase in the number of rows and dimensions, the number of association relationships generated by the other three algorithms also increases sharply. The association relationship generated by the Marc algorithm is relatively stable, and the higher the number of rows and the dimension, the higher the stability of the Marc algorithm. The Marc algorithm significantly reduces the number of association rules generated, and the more samples there are, the more stable the mined association relationship, which is more in line with the characteristics of the sample.

Accuracy
To facilitate the calculation, the verification sample set used to verify the mining accuracy is a 10-row 5-dimensional full-relational sample, and each row consists of five samples from X1 to X5. The minimum support degree set by the verification experiment is 10, and the minimum confidence degree is 0.5. Since it is a sample of full relationships, the support and confidence of all relationships are 1. The actual frequent item of the verification sample is 31, and the actual number of association relationships is 129.
The frequent items and association rules mined by the Marc algorithm are as follows:  From Table 2 and Table 3, it can be concluded that the accuracy of the Marc algorithm for mining the number of frequent items is 100%, and the accuracy of the mined association relationship is 100%.

Conclusion
The Marc Association Rule Mining algorithm is mainly for high-dimensional data mining scenarios. After experimental validation and comparative analysis, the following conclusions are drawn: (1) Marc simplifies the time and space complexity of the mining task at the beginning of the mining process by introducing the distribution factor and deletion threshold index when the data are first filtered, greatly improving the mining efficiency of the mining algorithm and saving memory space.
(2) Marc builds the whole relationship combination of samples based on the sample relationship table, which simplifies the complexity of maintenance in the mining process and improves the efficiency of mining frequent items and related relationships.
(3) The Marc algorithm is better than the Apriori, FP-Growth and Eclat algorithms in terms of mining efficiency and memory consumption, and the generated frequent items and correlations are closer to the sample and application reality. The higher the latitude is, the more obvious the advantage, which is more suitable for massive and high-dimensional data mining.
(4) The accuracy of frequent items and association rule relationships mined by Marc's algorithm is 100%.
(5) The mining efficiency of the FP-Growth and Eclat algorithms is lower than that of Apriori when mining high-dimensional data.