99
AN EFFICIENT-DISTRIBUTED MODEL FOR MINING SEQUENTIAL PATTERNS
ON A LARGE SEQUENCE DATASET
Tran Minh Thai
Department of Information Technology, HUFLIT
minhthai@huflit.edu.vn
ABSTRACT
Sequential pattern mining is an active research area because of its many different applications.
There have been many studies suggesting efficient mining algorithms. With the current trend, the size
of the sequence dataset is growing, and research has been applying the distributed processing model
on the problem of sequential patterns on sequence databases. One of the algorithms that apply the
distributed modeling to the efficient sequential pattern mining algorithm is the sequential pattern
mining algorithm based on the MapReduce model on the cloud (SPAMC). However, SPAMC is still
limited in mining datasets that have a large number of distinct items. This article proposes a
distributed algorithm to deal with this problem, called the distributed algorithm for sequential pattern
mining on a large sequence dataset using dynamic vector bit structures on the MapReduce distributed
programming model (DSPDBV). In addition, the algorithm uses different techniques for early prune
redundant candidates and reduce the amount of memory usage. Experimental results show that
DSPDBV is highly efficient and scalable for large sequence datasets. Moreover, DSPDBV is more
efficient than SPAMC handling datasets have a large number of distinct items.
Keywords: Distributed sequential pattern mining, Big data, Data mining, Dynamic bit vector,
Sequential pattern, MapReduce.
1. Introduction
A sequential pattern mining is a process of finding sequential patterns in a sequence dataset. It is
an important stage in sequence data mining. This issue plays a significant role in data mining and has
many useful applications in a variety of fields, such as economics, biology, e-commerce, or software
[1]. For example, in the field of economics, the common behavior of an object or group of objects can
determine the relationship between them. In the field of biology or medicine, these patterns can be
used to detect abnormalities of protein or DNA structures. In addition, in the software field, these
patterns can help predict inconsistent or unusual codes. The patterns can also be used in the field of
phrase analysis in text documents.
The sequential pattern mining problem has been extensively discussed and researched since
Agrawal and Srikant first proposed the AprioriAll [2] algorithm in 1995. Since then, many authors
have proposed other techniques, such as A priori-based algorithms including GSP [3], SPADE [4],
SPAM [5], LAPIN-SPAM [6], FreeSpan [7] and PrefixSpan [8]. However, these algorithms only mine
locally on a single computer with a small dataset, and these algorithms still have limitations, such as
difficulty expanding and communication costs in the system. To improve the performance and
scalability of sequential pattern mining on large datasets, many parallel mining techniques based on
distributed programing model MapReduce [9] have been proposed such as PTDS [10] and SPAMC
[11]. Although these algorithms have somewhat overcome the performance and scalability issues,
there is still the limitation of the runtime as the number of items increases.
To address this problem, an algorithm is proposed that can mine the distributed sequential
pattern using the dynamic bit vector based on the MapReduce framework called DSPDBV. The rest of
this article is constructed as follows. Section 2 contains the relevant preliminary concepts. Section 3
gives a brief description of the related works. Section 4 shows the proposed algorithm. Section 5
presents the experimental results, and the conclusions and future studies are included in Section 6.
2. Problem Definition
This section defines some basic concepts of sequence dataset mining and the issue of sequential
pattern mining.
Let I = {i1, i2, i3, …, in} be a set of distinct items (or events), where {ij} represents an item and 1
j n. A set of unordered items is called an itemset, and each itemset is represented in brackets. The
lexicographical order denoted by is defined as any total on I. All items in each itemset are
ordered based on . A sequence S = (e1, e2, e3, …, em) represents an ordered list of itemsets, where
100
ej represents an itemset (1 ≤ j m). The size of a sequence represents the number m of itemsets in the
sequence. A sequence with length k, called a k-sequence, represents the number of items in the
sequence. A sequence dataset SDB is a list of sequences and is denoted as SDB = {S1, S2, S3, ,
S|SDB|}, where |SDB| represents the number of sequences in SDB, and Si (1 i |SDB|) represents the
i-th sequence in SDB.
This article determines the full set of sequential patterns in a large sequence dataset SDB with a
given minSup value.
3. Related Work
Since Agrawal and Srikant proposed the AprioriAll [2] algorithm for the sequential mining
problem, sequential pattern mining has frequently studied. Initially, these algorithms performed linear
processing. Then, when the size of the dataset increase, the sequential pattern mining has become
inefficient. Thus, some studies have proposed algorithms for parallel mining sequential patterns.
Based on the characteristics of linear sequential pattern mining, these algorithms can be classified in
some approaches: the horizontal dataset format algorithms, such as AprioriAll [2] and GSP [3]; the
vertical dataset format algorithms such as SPADE [4], SPAM [5], and LAPIN-SPAM [6]; the pattern
growth algorithms such as FreeSpan [7] and PrefixSpan [8].
Algorithms that used the data structure in a vertical format have proven to be more effective
than others because they only scan the dataset one or two times. Furthermore, during the mining
process, they only use the information in the vertical format table to generate new candidates and
count their support by logic and bitwise operators. The SPADE [4] is a typical algorithm designed in
this approach. The SPADE represents the data as an IDList structure. To reduce the memory usages of
SPADE, SPAM, bitSPACE [12], and LAPIN-SPAM [6] used a bit vector data structure to represent
the data, where the value 1 corresponds to the occurrence of items in each transaction; otherwise, the
values are set to 0. To eliminate redundant bits (0), Vo et al. proposed a dynamic bit vector structure
(DBV) in DBV-Miner [13]. Then, Tran proposed the CloFS-DBV algorithm [14].
In addition, to eliminate the redundant candidates early, Philippe Fournier-Viger et al. proposed
the CMAP structure [15] and CM-SPAM algorithm [15] in 2014. Data is stored in CMAP to map each
item k in I to a set of items that can be associated with item k in two types of pattern extensions to get
new candidates. With a given minSup, CMAP is considered as a reference table of frequent 2-
sequence that satisfies the minSup.
The sequence datasets tend to increase quickly and be more complex. Consequently, these
datasets lead to a large space of mining being used. Therefore, the problem of mining sequential
patterns consumes more resources and time. Thus, a parallel model is the ideal solution for mining
sequential patterns. Some parallel algorithms were proposed, such as pSPADE [16]. In addition, PIB-
PRISM [17] and pDBV-SPM [18] algorithms have been proposed based on a multi-core processor.
Although these methods have good performance and scalability, they still suffer from limitations, such
as limited resources, communication costs, and load balancing.
Hadoop MapReduce [19] is based on the Google MapReduce platform, allowing substantial
amounts of distributed data across computer clusters with high reliability and availability. The
algorithms that have been designed on this platform include PTDS [10] and SPAMC [11]. The PTDS
divides the original dataset into small datasets and executes the mining based on PrefixSpan. The
limitation of PTDS is that it mines all the set of candidates in the subset of the dataset, so it does not
efficiently work on long sequences. Alternatively, SPAMC is based on SPAM. Instead of mining on a
full lexigraphy dictionary, it divides patterns into sub-trees and uses loops based on the MapReduce
model for each sub-tree to find patterns. However, the SPAMC algorithm is not effective for large
distinct items, since it takes a long time to transfer data from the MapReduce function.
4. The Proposed Algorithm
4.1. Data representation
The data structure of the proposed algorithm uses a dynamic bit vector structure that combines
the location information of the patterns on each transaction that was first proposed in the CloFS-DBV
algorithm [14], known as the CloFS-DBVPattern. In the DSPDBV algorithm, we make improvements
to the data structure to increase the algorithm's performance. Instead of being represented as a list of
positions, we use a bit vector to represent the positions of the sequential patterns on the transactions.
101
To extend the patterns, we prebuilt the reference matrix for the positioning process of the candidate
pattern on the transactions, called the sMATRIX and iMATRIX.
4.2. DSPDBV Algorithm
Given a sequence dataset SDB and a sequence S SDB, suppose SDB is divided into n
partitions. Thus, SDB = p1 + p2 + … + pn. Then, sup(S/SDB) = sup(S/p1) + sup(S/p2) + … + sup(S/pn).
There are three remarks: (1) Given a minimum threshold support (minSup), the sequence S is called a
pattern if sup(S/p1) + sup(S/p2) + + sup(S/pn) minSup; (2) Assume S is any pattern of SDB; if S
exists in pi (1≤i≤n), then S is also considered a pattern in pi; (3) Given two patterns SA and SB, where
|SA| < |SB|, the candidate generation time of SA is less than that of SB because SA has fewer candidates
than SB.
Distributed mining using MapReduce
Table 1: DSPDBV algorithm
Algorithm 1: DSPDBV
Input: an SDB, a minSup and the maximum depth of child prefix tree depth
Output: complete set of frequent sequential patterns
1.
let isExists;
2.
let partitions = split SDB into n partitions;
3.
call DistributedDBVConversion (partitions, minSup);
4.
let isExists = check if exist output for listDBVItems;
5.
while (isExists)
6.
call DistributedPatMining(listDBVPattens, minSup, depth);
7.
let isExists = check if exist output for listDBVPatterns;
MapReduce programing model is used for large data processing based on the theory of parallel
computing and distributed data processing model on clusters of computers. The proposed algorithm
(Algorithm 1) is implemented on a MapReduce model for parallel and distributed sequential pattern
mining. In which, a Map function performs a data conversion and generates candidates. A Reduce
function calculates the support of candidates, and candidates that satisfy the minSup value are stored.
The implementation process is then divided into two phases: (1) DBV Conversion Phase: It is done by
a Map and a Reduce function. The Map function performs Step 2, and the Reduce function performs
Step 3 of the algorithm; (2) Sequential Pattern Mining Phase: It is also done by a Map and Reduce
function. It performs Steps 4 and 5. In the mining process, this phase calls back based on the loop of
the execution of MapReduce process. MapReduce works on data defined by a pair <key, value>,
where key represents an item or a sequential pattern, and value represents a BlockInfo or a support.
Thus, while the execution of MapReduce process, the algorithm must convert the data into (key,
value). The details of this process are described in the following section.
Table 1 shows the pseudocode of DSPDBV algorithm. First, it divides SDB into n partitions
(line 2). Then, it scans these partitions to convert sequences into the DBVItem structure and find 2-
sequences (line 3). Lines 5-7 perform the sequential patterns by mining subtrees in the same level, the
candidates are generated by extending nodes on these subtrees. This process is repeated until there are
no new candidates.
Distributed DBV Conversion
The dataset after being partitioned is distributed and stored in HDFS. In the DBV Converting
Phase, the data conversion process is performed in parallel by executing a MapReduce job process
(Table 2), and a Map function is called to do a partition. Each Map function builds the DBVItem
dataset. The Map function also finds sets of 2-sequences to build CMAP (line 1). There are two types
of key-value pairs that are output to the Reduce function, and key-value pairs with the same key are
transferred to the same Reduce function. If the data is DBVItems then key-value is <item, blockinfo>.
In which, the key is an item in DBVItem, and the value is blockinfo of this item (lines 2-3). The other
102
is <pattern, support> if data is 2-sequences where the 2-sequence key and value support this sequence
(lines 4-5).
Table 2: DistributedDBVConversion algorithm
Algorithm 2: DistributedDBVConversion
Mapper Side
Input: a partition of sequence database p
1.
scan partition data only one time to
a. DBVItems<item,blockInfo>construct DBVItem for each item
b. sequences<pattern,support>find 2-sequences and its support
2.
for each (dbv in DBVItems)
3.
output <dbv.item, dbv.blockInfo>
4.
for each (item in sequences)
5.
output <item.pattern, item.support>
Reducer Side
Input: mapper output pairs <key, values> and a minSup
Output: complete set of frequent items, 2-sequences and set of DBVItems
divide by partition (listDBVItem)
1.
let support;
2.
let blockInfos<BlockInfo>;
3.
if (key is pattern) // pattern is 2-sequence
4.
for each (value in values)
5.
let support = sum of sup(value);
6.
if (support >= minSup)
7.
output <key, support>;
8.
else if (key is item)
9.
for each (value in values)
10.
let support = sum of sup(value);
11.
add value to blockInfos;
12.
if (support >= minSup)
13.
output <key, support>;
14.
for each (blockInfo in blockInfos)
15.
output <key, blockInfo> divide by partition;
Reduce function calculates the sum support of items or 2-sequences. With the key is 2-
sequences, sum support is calculated by the sum of the value and eliminates the 2-sequences that do
not satisfy the minSup. (lines 8-12). With the key is the item, the support is counted by DBV in
BlockInfo, and items that meet the minSup condition are retained in each corresponding partition
(lines 1315).
Distributed Sequential Pattern Mining
In the sequential pattern mining process, the exploitation of the entire candidate patterns on the
DBVTree prefix tree for large data sets is ineffective due to the wide search space that requires a lot of
resources. In addition, branches of the tree are independent of each other, so the process of generating
candidates can be performed independently.
103
Thus, the proposed algorithm is designed to perform a tree level to make the mining process
more flexible and require fewer resources. Candidates in the sub-prefix tree are exploited
independently in parallel by implementing MapReduce processes. Furthermore, the branches of the
tree have different heights, with the parallel exploitation of the sub-prefix trees high enough to make
the mining process achieve better load balancing. In addition, the algorithm is designed to optimize the
data transfer between Map and Reduce, and details of the implementation of distributed mining are
described in Algorithm 3 (Table 3).
Table 3: Distributed Sequence Pattern Mining algorithm
Algorithm 3: DistributedPatMining
Mapper Side
Input: a list of root nodes of child prefix tree roots, a minSup and the
maximum depth of child prefix tree depth
1.
scan listDBVItem of particular partition only one time to
a. load list of DBVItems into DBV<item,blockInfo>
b. extract list item frequent items into items<item>
2.
scan 2-sequence data only one time to construct CMAP
3.
for each (root in roots)
4.
call DBVPattenExt(root, DBV, CMAP, items, items, 1, depth);
Reducer Side
Input: mapper output pairs <key, values>, a minSup and the maximum depth
of child prefix tree depth
Output: complete set of frequent sequential patterns and set of
DBVPatterns divide by partition (listDBVPattern)
5.
let support;
6.
let blockInfos<BlockInfo>;
7.
for each (value in values)
8.
let support = sum of sup(value);
9.
if (key is a node at lowest depth in child prefix tree)
10.
add value to blockInfos
11.
if (support >= minSup)
12.
output <key, support>;
13.
if (key is a node at lowest depth in child prefix tree)
14.
for each (blockInfo in blockInfos)
15.
output <key, blockInfo> divide by partition;
Generate Candidate Pattern: Candidates are generated in parallel by the Map function. The
algorithm uses the prefix tree structure, where nodes arranged in the order of the dictionary, helping
the process of generating the candidate pattern is highly effective. Each node in the tree is a candidate
pattern, expressed in the DBVPattern structure. Data is then divided into subsets so that each node in
the tree is independently exploited in different Map functions to improve execution time. The
extending subtree process is done in three main steps: (1) building a frequent DBV set, (2)
constructing a CMAP structure, and (3) expanding the subtree (Table 3).
Find the sequential pattern: The Reduce function calculates the total support of the candidate
patterns. In addition, if the candidate pattern is the node at the last level of the tree, the BlockInfo
information of the candidate pattern is retained (lines 7-10). Candidates that meet the minSup
condition are retained (lines 10-11), and the DBVPatten information of these candidates are stored and
distributed corresponding to the input data (listDBVPattern datasets) to perform a sequential mining
phase (lines 13-15) (Table 4).
Table 5 describes the OutputDBVPattern algorithm called in the DBVPatternExt algorithm to
perform the resulting transfer from the Map function to the Reduce function. Instead of transmitting all