Trie-based Output Space Itemset Sampling

Pattern sampling algorithms produce interesting patterns with a probability proportional to a given utility measure. Utility changes need quick repreprocessing when sampling patterns from large databases. In this context, existing sampling techniques require storing all data in memory, which is costly. To tackle these issues, this work enriches D. Knuth’s trie structure, avoiding 1) the need to access the database to sample since patterns are drawn directly from the enriched trie and 2) the necessity to reprocess the whole dataset when utility changes. We define the trie of occurrences that our first algorithm TPSpace (Trie-based Pattern Space) uses to materialize all of the database patterns. Factorizing transaction prefixes compresses the transactional database. TPSampling (Trie-based Pattern Sampling), our second algorithm, draws patterns from a trie of occurrences under a length-based utility measure. Experiments show that TPSampling produces thousands of patterns in seconds.


I. Introduction
Pattern mining [1] is an active research field that aims at discovering interesting and non-trivial information in large databases. Methods for discovering relevant patterns in a transactional database are known as itemset mining methods. During the last decade, researchers in this field addressed the pattern explosion problem, which was caused by the combination of the volume of data and the combinatorial nature of mining methods. In fact, controlling the size of the set of frequent patterns given a minimum threshold, for example, is extremely difficult. On the one hand, if the minimum threshold is very low, the enduser is overwhelmed by the number of returned patterns. However, if the minimum threshold is very high, the set of patterns may be empty. Many approaches are proposed to solve this problem, such as Top-k pattern mining [2], which returns the k most frequent patterns but lacks diversity. The last approach proposed is based on output pattern sampling [3]. Output pattern sampling is the process of selecting a sample of patterns from the entire dataset. It is a non-exhaustive method for discovering relevant patterns that provides high statistical guarantees thanks to its random nature. Pattern sampling has been shown to be an important component of interactive pattern-based mining systems by providing anytime methods [4] and by integrating user feedback [5], [6].
With state-of-the-art sampling methods, the entire database must be stored in memory, except [7], where the data are natively decentralized in different sites and the algorithm executed in a single machine. In our case, we assume all transactions are on a single machine where the user can run a sampling algorithm. In large databases, a compact data structure can be used to compress the database, as was done by Han et al. [8] for exhaustive frequent patterns mining. No output pattern sampling method in the literature [3], [9], [10], [11] has been applied to compact database representation. Besides, when a user changes a utility measure, reprocessing must be easy. An efficient method should not weight every transaction. Large databases slow down the processing phase. Diop et al. [7] address changing utilities like frequency, area, or decay. However, the proposed solution depends on the database number of transactions. This paper discusses ways to speed up pattern sampling in large databases when changing utility measures. We sample patterns directly from a compressed database. So, our main goal is to come up with a generic output pattern sampling algorithm that draws patterns from a trie of occurrences based on a length-based utility measure [7].
Our main contributions are as follows. • We introduce a new structure called trie of occurrences and propose TPSpace (Trie-based Pattern Space), its construction algorithm. Each node in a trie has weight information for drawing a pattern. In our case, we weight each node based on the number of occurrences in the sub-trie of which it is the root. • We propose TPSampling (Trie-based Pattern Sampling), a generic algorithm for sampling patterns from a trie of occurrences according to a length-based utility measure. TPSampling is generic because it takes into account any length-based utility measure.
The remainder of this paper is structured as follows. Section II situates our work in the state-of-the-art of pattern sampling and Section III highlights the challenges that must be overcome in order to achieve our goal. Our main contributions are covered in Section IV and Section V. The theoretical analyses are presented in Section VI, the experimental results in Section VII, and Section VIII concludes the paper.

II. Related works
This section presents the related works in pattern sampling and data structures for pattern mining.

A. Pattern sampling techniques
Since the first proposition of pattern sampling method [3] in 2009, numerous algorithms have been proposed for output pattern sampling [9], [10], [12], [11], [13]. Its utility has been widely demonstrated in many areas in recent years, including feature classification [9], outlier detection [4], and interactive discovery [5], [6]. It has also been used in a variety of structured data formats, including graphs [3], itemsets [9], and sequences [11]. To circumvent the long tail issue [9], Diop et al. [11] weight each pattern with a norm-based utility. In this paper, we tackle the stateof-the-art methods based on their efficiency in memory storage and flexibility on utility change. Efficiency in memory storage: It is worth noting that all of these algorithms operate locally on a single machine. As a result, they completely use the available RAM by storing the whole data set, which might be a challenge with huge databases. Flexibility on utility change: When the user changes the utility measure, these approaches perform a reprocessing phase that involves weighing each database transaction according to the new utility measure for the multi-steps ones. Diop et al. [7] demonstrate that the reprocessing phase is experimentally fast since it is proportional to the number of transactions in the database. This argument is no longer valid in large databases with many transactions.

B. Data structures for pattern mining
Many data structures have been proposed to solve the problem of pattern mining, such as the "FP-Tree" [8] or "Trie" [14]. Because multiple transactions can contain the same information, there are many repetitions in transactional databases. These repetitions make sense in this field because they allow us to discover interesting rules; however, we must first understand how to represent them. In a transactional database, for example, if 90% of transactions that contain the items {e1, e2, e3, e4} also contain the items {e5, e6}, we might as well group them together so that they share the same prefix. This significantly reduces the database's size in memory. It is now possible to represent transactions containing the same item in a single path using "FP-Tree" or "Trie". This is due to the fact that they share the same prefix. The difference between these two structures is that, unlike "trie", which only connects a node to its children, "FP-Tree" connects nodes from different branches to quickly compute the frequency of the patterns. We propose using "trie" [14] to have a compact representation of the database in memory because we do not want to compute this latter. Trie has been used in Big Data for graph distributed computing [15].
We propose an original multi-step pattern sampling method in this paper, the first approach based on compact structure. We will see that the goal of using the compact structure is not only for memory issues, but also to facilitate reprocessing when the user changes the utility measure: utility change flexibility.

III. Preliminaries and problem statement
In this section, we first present some basic concepts and definitions that are required for understanding the subject. We conclude with a formalization of the problem addressed in this paper.

A. Basic definitions
Let I = {e 1 , · · · , e N } be a finite set of literals called items. We assume an arbitrary total order > I exists between the items: e 1 > I · · · > I e N . An itemset (or pattern), denoted by ϕ = {e i1 , · · · , e in } (or simply ϕ = e i1 · · · e in ), with n ≤ N , is a none empty subset of I, ϕ ⊆ I. The set of all patterns that we can generate from I is called the pattern language, denoted by L = 2 I \ {∅}. The length of a pattern ϕ ∈ L denoted by |ϕ| is the number of items it contains (its cardinality). A transactional database D is a multi-set of itemsets (called transactions) where each of them has a unique identifier j ∈ N. We denote by t j = e 1 · · · e n a transaction identified by j of length |t|  Table I is a transactional database made up of four items, I = {A, B, C, D}. In the rest of this paper, we use this database to give some illustrations.
Originally, the goal of a pattern sampling technique is to access the pattern space L(D) by an efficient sampling procedure simulating a distribution P: L(D) → [0; 1] which is defined with respect to some utility measure m : P(·) = m(·)/Z where Z is a normalization constant, the sum of the utilities of all patterns ϕ ∈ L(D), defined by Z = ϕ∈L(D) m(ϕ, D). The selection of k patterns of L(D) according to a distribution proportional to a utility measure m may be expressed as follows: where ϕ ∼ m(L, D) means that ϕ is drawn with a probability proportional to m. Formally, ϕ ∼ m(L, D) ⇔ P(ϕ, L(D)) = m(ϕ, D)/Z. In other words, the main objective of the output sampling methods is to get a sample of patterns that is representative of the set of patterns that can be extracted from the database. For example, if a pattern ϕ 1 has a utility twice as high as that of a pattern ϕ 2 according to the utility measure chosen by the user, then ϕ 1 should be twice as likely to be in the sample as the pattern ϕ 2 . Frequency is the most common utility measure.
Definition 1 (Frequency of a pattern): Let D be a database and ϕ be a pattern of L(D). The frequency of ϕ in D is defined as follows: freq(ϕ, D) = |{t i ∈ D : ϕ ⊆ t i }|.
By definition, the operator Sample k with k > 0 is not deterministic if L(D) has at least two patterns with probabilities that are not null. In other words, two draws with the same utility measure in the same database may not return the same k patterns.
It is also common to give a utility to an itemset and to combine the frequency of an itemset with its utility. In this paper, we deal with length-based utility measures [16].
Definition 2 (Length-based utility measures [16].): A utility u defined from L(D) to R is called a length-based utility if there exists a function f u from N to R such that u(ϕ) = f u (|ϕ|) for each ϕ ∈ L(D) . Given the set U of length-based utilities, M is the set of utility measures m u such that for every pattern ϕ and database D, For example, with the frequency, the utility function u freq (ϕ) = 1 for all patterns in L(D). If we consider the utility function u area (ϕ) = |ϕ|, we obtain the area measure: freq(ϕ, D) × |ϕ|. If u decay (ϕ) = α |ϕ| , we get an exponential decay in α ∈]0, 1]. More generally, we consider the class of utility measures of the form freq(ϕ, D) × u(ϕ) where u exclusively depends on the length of itemsets.

B. Key ideas, challenges and problem statements
We first focus on some interesting key ideas and challenges to situate our work before formulating the questions we should properly answer in order to achieve our goal. a) Key idea: As we have pointed out, drawing a pattern is one of the most important steps in pattern sampling, especially in the case of user-centered mining. In this paper, we suggest using a trie structure to build the pattern space while making sure that reprocessing in utility change can be done in a flexible way. Definition 3 (Trie [14]): A trie is a data structure in the form of a rooted tree such that for any node, its descendants have the common prefix.
Example 2: We represent the database D as a trie in Figure 1. To do this, each node in the trie has the set of transactions in the database that end with its label. It is important to note that many representations of the database D as a trie are possible depending on the insertion order of the items (decreasing order of their frequencies, ascending order, lexicographic order, etc.).
b) Challenges with trie-based pattern sampling: It is difficult to sample a pattern in the trie structure. Aside from ensuring that a pattern is drawn exactly with a probability proportional to its weight, we must also pay attention to new length-based utilities to avoid timeconsuming reprocessing. In that case, the following are the primary issues that must be addressed: • efficiently build and weights the trie that corresponds to the database, • draw a pattern directly from the trie proportionally to its utility in the database. c) Problem statement: These main challenges can be finally solved by answering the following questions that we formulate here. Let D be a database, u, u ∈ U two length-based utilities, and µ and M two integers such that 0 < µ ≤ M . 1) What should be done to a (classical) trie to allow direct pattern sampling without the underlying transactional database? 2) How to draw a pattern ϕ from L  Table II.

IV. TPSpace: Trie-based Pattern Space
To define our new data structure, we need to introduce the notion of occurrence: Set of patterns of D with lengths between the length constraints µ and M u Length-based utility that belongs to the set of length-based utility U m u (ϕ, D) The utility measure of the pattern ϕ in D that combines frequency and utility u f u Utility function defined from N to R + such that Set of occurrences of D with lengths between the length constraints µ and M η Node of the trie T P Identifier of a node, it's also a prefix P Label of the node η identified by P , P = η.label T P Sub-trie of the trie T whose root is the node identified by P φ + (P, D) (resp. φ − (P, D)) Set of occurrences of length with (resp. without) the item P in the sub-trie Cardinality of the set of occurrences of length in the trie T rank (ϕ j , T ) Rank of the occurrence ϕ j of length in all the set of occurrences in L o is a set of patterns, and, unlike a pattern, an occurrence belongs to one and only one transaction. The frequency of a pattern in L(D) is the cardinality of the set of its occurrences in L o (D). We have then

Example 3:
To draw a pattern of length 2 proportionally to its frequency in D, it suffices to uniformly draw an occurrence in the set L o As a result, the probability of drawing the pattern AB in the set S is equal to P(AB, . Using a uniform drawing of occurrences, we may draw a pattern of length based on its frequency among patterns of the same length. If we can draw a length proportionally to the sum of patterns utilities of length , we can then pick a pattern from the database based on its utility.

A. Definition of a trie of occurrences
Since many representations are possible according to the order insertion, we start by defining the total order relations used in this paper before formalizing the identifier and the content of a node.
Definition 5 (Total order relation between items): Let I be a set of items or literals on which the transactional database D is defined. We are now going to define the notion of node identifier in a trie. It is a concept that will allow us to enrich the trie of occurrences from a transactional database.
Definition 6 (Node identifier): Given a set of items I = {e 1 , · · · , e n } and a symbol ∈ I, a trie T defined on I is a tree where every node η ∈ T except the root contains a label denoted by η.label ∈ I, and where the root of the trie contains the label . Thus, any node η ∈ T can be identified by the sequence of node labels on the path from the root of T to the node η. If e i1 . . . e i k is this sequence, we write it down more simply as P = e i1 . . . e i k , and we denote by P = e i k the label of the identified node η, P = η.label. For the root, we set that P = ∅.
In the following, we will often use the concept of a subtrie of a trie defined below.
Definition 7 (Sub-trie): Let T be a trie and P the identifier of a node. We denote by T P the sub-trie of T whose root is the node identified by P .
We now define the concatenation operator • as follows. Definition 8 (Concatenation operator •): Let ϕ and ϕ be two itemsets defined in I and ordered according to If > I is the lexicographic order, then we have B • AC = BC and A • BC = ABC. With this concatenation operator, we can define a prefix that will be used in the definition of a truncated database.
Definition 9 (Prefix): Let t be a transaction defined in I and P a sequence of items ordered according to > I . P is a prefix of the transaction t if there is an itemset ϕ ⊆ I such that t = P • ϕ. In the database D in Fig. 1, P = AB is a prefix of the transaction t 4 = ABCD but P is not a prefix of the transaction t 3 = BC. According to Definition 9, transactions with a common prefix can be grouped.
To determine which sub-trie a pattern occurs in, we introduce the concept of a truncated database.
Definition 10 (Truncated database): Let > I be a total order relation on all items, D be a transactional database, and P be a node identifier. A truncated database of D on P , denoted by D P , is a transactional database that holds a copy of any transaction t of D with prefix P minus the items of trans that appear before P .
Example 5: Let us consider the trie of the transactional database D in Fig. 1. Occurrences in the truncated database D AB are the occurrences stored in the sub-trie whose root is identified by the prefix AB: B 1 , B 4 , BC 4 , BD 4 , CD 4 , BCD 4 . The occurrences stored in the subtrie whose root is identified by the prefix AC: C 2 are the occurrences in the truncated database D AC . Now, by splitting these groups of occurrences by length, we will identify which of them are represented at the top level of the truncated database of D on the prefix P .
Definition 11 (Computing weights Φ − and Φ + ): Let D be a transactional database and P a prefix. The set of occurrences of length of the truncated database on the prefix P is defined by: φ (P, D) = {(i, ϕ) ∈ N × L(D) : (i, ϕ ) ∈ D P ∧ ϕ ⊆ ϕ ∧ |ϕ| = }. Φ (P, D) denotes the total number of occurrences of length in the truncated database D P : Φ (P, D) = |φ (P, D)|. The set of occurrences φ (P, D) can be split into two parts: • The set of occurrences of length of the database truncated on the prefix P and containing the item P is defined by: Its cardinality is denoted by Φ + (P, D) = |φ + (P, D)|. • The set of occurrences of length of the database truncated on the prefix P without the item P is defined by: Its cardinality is denoted by Φ − (P, D) = |φ − (P, D)|.  for ∈ [µ..M ] with P the identifier of the node η in T , and µ and M the length constraints. Example 7: Let us consider the transactional database of Fig. 1, the minimum µ = 1 and maximum M = 3 length constraints, we build the trie of occurrences according to the total order relation > I in Fig. 2.
In this example, the set of labels for the children of the root is {A, B}. The number of patterns of length = 2 in the trie T is equal to T .φ 2 (∅, D) = 9. Let η be the node identified by P = A. Then we have η.φ + 2 (P, D) = 5 and η.φ − 2 (P, D) = 3 to say that the sub-trie T A contains 5 occurrences of length 2 with the item P = A : AD 4 , AC 4 , AB 4 , AB 1 , AC 2 , and 3 occurrences of length 2 without the item P = A : CD 4 , BD 4 , BC 4 .

B. TPSpace: Algorithm for building a trie of occurrences
We describe how to generate a trie of occurrences from a transactional database to effectively handle length-based utility measures. First, note that the transactions are added to the trie iteratively. In our case, we must compute the positive and negative contributions of each transaction t into a node P identified by P , where P is a prefix of t.
Property 1: Let D = {t 1 , · · · , t n } be a transactional database. We denote by D i the subset of transactions defined by D i = {t k ∈ D : 1 ≤ k ≤ i}. If P is a prefix of t i , then we have: We need now to introduce some basic functions for creating, adding, or finding a node when inserting the items of a transaction into a trie.
• Let CreateNode be the function defined by η ← CreateNode(e) where η is a node such that η.label = e, and η.child = ∅ represents here an empty list of nodes. • Let SearchChild the function defined by η.child[i] ← SearchChild(e, η) if there is i such that η.child [i].label = e, null otherwise. • Let AddChild be the function allowing to add a child to a node. More precisely, if η is a node such that k = |η.child|, we will consider that after execution of AddChild(c, η), we have |η.child| = k + 1 and η.child[k + 1] = c.
In the following, t[j], with j > 0, is the j th item of the transaction t according the total order relation > I .

15:
Compute c.Φ − (P, Di) 16: η ← c 17: return T Algorithm 1 describes the TPSpace method to create a trie of occurrences of an input database D according to a total order relation > I . We initialize the trie of occurrences (line 3) by creating an empty node with the CreateNode function. For each transaction t of the input database whose items follow the order relation > I , we start at the root then, using Property 1, we compute and add its total contribution in the trie according to the lengths (line 6). Then, for each item t[j] of the transaction being inserted in the trie, if there is not a child node c labelled with the item t[j] according to the SearchChild function (line 9), we create it using the function CreateNode (line 11) then we add it among the children of η with the function AddChild (line 12). Finally, we add the positive and negative contributions of the transaction t to the node c (lines 13 to 15) using Property 1. We now go to node c (line 16) and the process starts again with the item at position j + 1 in t. Finally, line 17 returns the trie of occurrences T of the database D.

V. TPSampling: Trie-based Pattern Sampling
This section introduces trie of occurrences basics to understand our method. Then, it presents the algorithm TPSampling to draw a pattern proportionally to a given utility measure.

A. Drawing approach
To draw a pattern of length proportionally to a lengthbased utility u multiplied by its frequency in the database, we can uniformly draw an occurrence among the set of occurrences of length . To do this, we first need to draw an integer ∈ [µ..M ] proportionally to Φ (∅, D) × f u ( ). Second, we uniformly draw an occurrence of length from L o [ .. ] (D), but directly from the trie. More precisely, a numbering system assigns a number to each occurrence and then, we draw a random number for selecting the occurrence. The intuition of the numbering system that we use can be summarized as follows: • It is a recursive and postfix traversal (in depth and from left to right). The occurrences represented at the root of a sub-trie are numbered from left to right and the children of a node are ordered, • At the top level of a sub-trie, from the root identified by a prefix P , we give a lower rank to occurrences without the label P than others containing P . Definition 13 (Ranking occurrences by length): If ϕ j is an occurrence of length from a database D and T is a trie constructed from D, we denote rank (ϕ j , T ) the rank of this occurrence in L o [ .. ] (D) relative to the trie T . This rank can be defined recursively as follows.  Table III gives the rank of each occurrence in the trie of Fig. 2: We also know that CD 4 ∈ φ − 2 (A, D), which implies that rank 2 (CD 4 , T ) = rank 2 (CD 4 , T A ) = rank 2 (CD 4 , T AB ) = rank 2 (CD 4 , T ABC ). Then we have

B. Trie-based pattern sampling algorithm
Algorithm 2 takes as input a trie of occurrences T , a length-based utility u ∈ U, and minimum µ and maximum M length constraints. It returns a pattern ϕ drawn proportionally to its utility in the corresponding database. Find the i th child ηi ∈ TP .child such that : Check if the label of the current node is part of the pattern 10: ϕ ← ϕ ∪ ηi.label 11: x ← x − ηi.Φ − (P, D)

Uniform drawing of an occurrence of length .
To sample an occurrence of length , we uniformly draw a rank x in the interval [1..T .Φ (P, D)] (line 5). To find the occurrence corresponding to x, we scan the trie in depthfirst search from left to right by looking for the nodes that satisfy the system of inequalities in line 7 which is based on Definition 13. This system of inequalities makes it possible to find the rank of the occurrence from the trie of root T . Whenever we encounter a node verifying the system of inequalities, we test whether the item it contains is a candidate for the pattern to be returned (line 9), and we add it to the pattern if necessary (line 10). In line 13, we consider the sub-trie whose node satisfying the system of inequalities is the root. Thus, the new rank to visit is the one obtained by subtracting from the old value of x the sum of the weights of the i − 1 first children of the current node, father of η i , (line 8) and the negative input to node η i (line 11). We will then look for the remaining −1 items of the pattern to be returned in the sub-trie of root η i . The process is iterated until the current value of is equal to 0. The set of items selected at the different visited nodes form the pattern to return at line 14.

VI. Theoretical analysis
This section examines our trie sampling strategy in terms of soundness and complexity (memory storage and temporal). Property 2 shows that our sampling method TPSampling does an exact draw of a pattern.
Property 2 (Soundness): Let T a trie of occurrences from a transactional database and u a length-based utility, Algorithm 2 draws a pattern ϕ proportionally to its frequency weighted by its length-based utility.
Proof 2: Omitted due to space limitation.

A. Space complexity
The size of a trie of occurrences also depends on the information stored in the nodes. In our case, the higher the maximum length constraint, the larger the arrays and the greater the memory size. This means that if the number of nodes in the trie of occurrence is z, µ and M the minimum and maximum length constraints respectively, then the size in memory of the trie is in O(z × 2 × (M − µ)). Fortunately, the maximum length constraint must generally be small to avoid the long tail problem. It is also important to note that, to have a good practical consumption of memory storage, we do not materialize the columns of tables that only contain zero values. This trick counterbalances the impact of the maximum length constraint increase. Furthermore, a tight upper bound of the number of nodes is detailed in [17].

B. Time complexity
The time complexity of our method can be divided into three phases: preprocessing time to build the trie of occurrences, re-preprocessing times in utility change, and drawing time of an occurrence. a) Preprocessing time: It is the most expensive phase of TPSpace. A first pass on the database is necessary to retrieve the items from the database D and to compute their frequencies in O(||D||) where ||D|| is the sum of the lengths transactions from the database D. The previously retrieved items are ordered according to the chosen relation > I in O(|I| × log(|I|)). Then, before adding a transaction to the trie, we order its items in O(T max ×log(T max )) where T max is the maximum length of transactions in the database D. Finally, let z be the total number of nodes in the trie, µ and M the minimum and maximum length constraints respectively, then the weighting of the nodes is done in O(z × 2 × (M − µ)). Thus, the total complexity for building the trie of occurrence of the transactional database D built on the set of literals I is in O(||D||+|I|×log(|I|)+|D|×T max ×log(T max )+z×(M −µ)). b) Reprocessing time in utility change: When the utility changes without updating the length constraints µ and M , the complexity of the reprocessing time is in O(M − µ), which is particularly tiny. This is because only the array of the root of the trie is traversed to compute the new weight of each length ∈ [µ..M ]. c) Drawing time of an occurrence: Let us denote by d the degree (number of children) of a node of the trie and by d max the maximum degree of the trie, d max ≤ |I|. Line 7 of TPSampling finds i th node in O(log(d max )). Thus, by going deeply through the trie of occurrences, TPSampling draws an occurrence in O(T max × log(d max )). So, a sample of k patterns is obtained by TPSampling in O(k ×T max × log(d max )). This complexity is comparable to that of the two-step algorithm [9] (with length constraints) which draws a sample of k patterns in O(k × T max × log(|D|)).

VII. Experiments
This experimental section aims to assess the efficiency of our approach to large transactional databases. The experiments were conducted with 2 UCI databases Susy and USCensus, and 2 synthetic databases built with the IBM-Generator 1 T10I4D2000K and T10I6D3000K. Table IV provides benchmarks by number of items, transactions, and maximum and average transaction length. It shows the number of trie nodes for each database and according to the total order relation. The minimum length constraint is fixed at µ = 1 throughout the experiments. The prototype of our method is implemented in Python version 3 and all the experiments are performed on a 2.71 GHz 2 Core CPU with 12 GB RAM. The source code is available at https: //github.com/TPSampling/TPSampling. We also implement an approach of Two-Step proposed by Boley et al. [9] under length constraints as a baseline.

A. Storage cost of the trie of occurrences
The cost 2 of storing a trie of occurrences in a database depends on both the total order relation > I and the maximum length constraint. According to the gain obtained in the last column of Table VI, the number of nodes is substantially lower with the > freq I relation than with the > lexico I relation. As a result, the fewer the nodes, the lower the storage cost. Due to an "Out of memory" issue, it is not feasible to perform TPSampling with the lexicographical order in the last two databases. Fig. 3 shows the evolution of the memory size required by the tries of each database according to the maximum length constraint M ∈ [2..10] and the chosen order relation. These experimental results show that our approach is sensitive to the total order relation. For instance, TPSampling +> freq I returns an "Out of memory" 1 https://github.com/zakimjz/IBMGenerator 2 Computed with the python package asizeof http://code. activestate.com/recipes/546530-size-of-python-objects-revised/   exception with T10I6D3000K when the maximum length constraint is greater than 7. We also found that the trie of occurrences created according to > freq I used less memory than the Two-Step database representation. The later generates an "Out of memory" exception with Susy, while both Two-Step and TPSampling +> lexico I return "Out of memory" exception with T10I4D2000K and T10I6D3000K.

B. Speed of the approach
This section analyses preprocessing, reprocessing, and pattern draw of our approach.
Evaluation of the preprocessing time. Interestingly, the time to build a trie of occurrences is independent of any length-based utility measure. In our experiments, we consider the maximum length constraints M ∈ {2, 6}. Table V presents the preprocessing times to build the tries of occurrences and those of Two-Step according to the maximum length constraint. Each experiment is repeated 10 times to have the average preprocessing times and the standard deviations. Because it only requires one pass through the database, Two-Step is faster than TPSpace in preprocessing. However, on Susy, T10I4D2000K, and T10I6D3000K, it throws a "Out of memory" exception, but TPSampling+ > freq I lasts on average 18 minutes with the maximum length constraints M = 6 in T10I6D3000K. Interestingly, we only do this preprocessing once, after which we may utilise the resulting trie of occurrences with any length-based utility. Evaluation of the reprocessing time for utility change. The reprocessing time, when utility changes, depends linearly to the difference between the minimum and the maximum length constraints only. Utility measures like frequency, area, and exponential decay have not a notorious impact on the speed of the reprocessing phase. Contrariwise, Two-Step should do a new preprocessing from scratch when utility changes. Experiments show that in the reprocessing phase our approach needs a few time, less than 10 × 10 −6 seconds with M = 10 while Two-Step needs 0.45 seconds with USCensus. Interestingly, the reprocessing time is the same for all databases with the same length constraints. These results show the importance of the trie when changing the length-based utility measure. Evaluation of the drawing time per pattern. We test the speed of our approach on the 4 databases and figure out the average drawing time of a pattern with a maximum length constraint in [1..10] and an exponential . When M is more than 7, TPSampling throws a "Out of memory" issue with the T10I6D3000K database. We see that the drawing times of TPSampling+Area and TPSampling+Freq increase practically identically. TPSampling is efficient since it produces thousands of patterns per second.

VIII. Conclusion
This paper proposed a generic trie-based output pattern sampling method using two efficient algorithms. TPSampling samples patterns based on any lengthbased utility measure, using a trie of occurrence built by TPSpace. After building a trie of occurrences with fixed length constraints, the user can draw patterns with frequency, area, and exponential decay α ∈]0, 1]. The experiments also show that our approach is very flexible on utility change and works well with large transactional databases thanks to the prefix-based compression.
We hope to parallelize our method in the future by adapting the BSP-based framework proposed by Diop and Ba [18]. In such case, the trie might be spread over many machines to parallelize the computation in the preprocessing phase, as well as the drawing and reprocessing phases.