
99
AN EFFICIENT-DISTRIBUTED MODEL FOR MINING SEQUENTIAL PATTERNS
ON A LARGE SEQUENCE DATASET
Tran Minh Thai
Department of Information Technology, HUFLIT
minhthai@huflit.edu.vn
ABSTRACT
Sequential pattern mining is an active research area because of its many different applications.
There have been many studies suggesting efficient mining algorithms. With the current trend, the size
of the sequence dataset is growing, and research has been applying the distributed processing model
on the problem of sequential patterns on sequence databases. One of the algorithms that apply the
distributed modeling to the efficient sequential pattern mining algorithm is the sequential pattern
mining algorithm based on the MapReduce model on the cloud (SPAMC). However, SPAMC is still
limited in mining datasets that have a large number of distinct items. This article proposes a
distributed algorithm to deal with this problem, called the distributed algorithm for sequential pattern
mining on a large sequence dataset using dynamic vector bit structures on the MapReduce distributed
programming model (DSPDBV). In addition, the algorithm uses different techniques for early prune
redundant candidates and reduce the amount of memory usage. Experimental results show that
DSPDBV is highly efficient and scalable for large sequence datasets. Moreover, DSPDBV is more
efficient than SPAMC handling datasets have a large number of distinct items.
Keywords: Distributed sequential pattern mining, Big data, Data mining, Dynamic bit vector,
Sequential pattern, MapReduce.
1. Introduction
A sequential pattern mining is a process of finding sequential patterns in a sequence dataset. It is
an important stage in sequence data mining. This issue plays a significant role in data mining and has
many useful applications in a variety of fields, such as economics, biology, e-commerce, or software
[1]. For example, in the field of economics, the common behavior of an object or group of objects can
determine the relationship between them. In the field of biology or medicine, these patterns can be
used to detect abnormalities of protein or DNA structures. In addition, in the software field, these
patterns can help predict inconsistent or unusual codes. The patterns can also be used in the field of
phrase analysis in text documents.
The sequential pattern mining problem has been extensively discussed and researched since
Agrawal and Srikant first proposed the AprioriAll [2] algorithm in 1995. Since then, many authors
have proposed other techniques, such as A priori-based algorithms including GSP [3], SPADE [4],
SPAM [5], LAPIN-SPAM [6], FreeSpan [7] and PrefixSpan [8]. However, these algorithms only mine
locally on a single computer with a small dataset, and these algorithms still have limitations, such as
difficulty expanding and communication costs in the system. To improve the performance and
scalability of sequential pattern mining on large datasets, many parallel mining techniques based on
distributed programing model MapReduce [9] have been proposed such as PTDS [10] and SPAMC
[11]. Although these algorithms have somewhat overcome the performance and scalability issues,
there is still the limitation of the runtime as the number of items increases.
To address this problem, an algorithm is proposed that can mine the distributed sequential
pattern using the dynamic bit vector based on the MapReduce framework called DSPDBV. The rest of
this article is constructed as follows. Section 2 contains the relevant preliminary concepts. Section 3
gives a brief description of the related works. Section 4 shows the proposed algorithm. Section 5
presents the experimental results, and the conclusions and future studies are included in Section 6.
2. Problem Definition
This section defines some basic concepts of sequence dataset mining and the issue of sequential
pattern mining.
Let I = {i1, i2, i3, …, in} be a set of distinct items (or events), where {ij} represents an item and 1
≤ j ≤ n. A set of unordered items is called an itemset, and each itemset is represented in brackets. The
lexicographical order denoted by is defined as any total on I. All items in each itemset are
ordered based on . A sequence S = (e1, e2, e3, …, em) represents an ordered list of itemsets, where