An efficient-distributed model for mining sequential patterns on a large sequence dataset

AN EFFICIENT-DISTRIBUTED MODEL FOR MINING SEQUENTIAL PATTERNS

ON A LARGE SEQUENCE DATASET

Tran Minh Thai

Department of Information Technology, HUFLIT

minhthai@huflit.edu.vn

ABSTRACT

Sequential pattern mining is an active research area because of its many different applications.

There have been many studies suggesting efficient mining algorithms. With the current trend, the size

of the sequence dataset is growing, and research has been applying the distributed processing model

on the problem of sequential patterns on sequence databases. One of the algorithms that apply the

distributed modeling to the efficient sequential pattern mining algorithm is the sequential pattern

mining algorithm based on the MapReduce model on the cloud (SPAMC). However, SPAMC is still

limited in mining datasets that have a large number of distinct items. This article proposes a

distributed algorithm to deal with this problem, called the distributed algorithm for sequential pattern

mining on a large sequence dataset using dynamic vector bit structures on the MapReduce distributed

programming model (DSPDBV). In addition, the algorithm uses different techniques for early prune

redundant candidates and reduce the amount of memory usage. Experimental results show that

DSPDBV is highly efficient and scalable for large sequence datasets. Moreover, DSPDBV is more

efficient than SPAMC handling datasets have a large number of distinct items.

Keywords: Distributed sequential pattern mining, Big data, Data mining, Dynamic bit vector,

Sequential pattern, MapReduce.

1. Introduction

A sequential pattern mining is a process of finding sequential patterns in a sequence dataset. It is

an important stage in sequence data mining. This issue plays a significant role in data mining and has

many useful applications in a variety of fields, such as economics, biology, e-commerce, or software

[1]. For example, in the field of economics, the common behavior of an object or group of objects can

determine the relationship between them. In the field of biology or medicine, these patterns can be

used to detect abnormalities of protein or DNA structures. In addition, in the software field, these

patterns can help predict inconsistent or unusual codes. The patterns can also be used in the field of

phrase analysis in text documents.

The sequential pattern mining problem has been extensively discussed and researched since

Agrawal and Srikant first proposed the AprioriAll [2] algorithm in 1995. Since then, many authors

have proposed other techniques, such as A priori-based algorithms including GSP [3], SPADE [4],

SPAM [5], LAPIN-SPAM [6], FreeSpan [7] and PrefixSpan [8]. However, these algorithms only mine

locally on a single computer with a small dataset, and these algorithms still have limitations, such as

difficulty expanding and communication costs in the system. To improve the performance and

scalability of sequential pattern mining on large datasets, many parallel mining techniques based on

distributed programing model MapReduce [9] have been proposed such as PTDS [10] and SPAMC

[11]. Although these algorithms have somewhat overcome the performance and scalability issues,

there is still the limitation of the runtime as the number of items increases.

To address this problem, an algorithm is proposed that can mine the distributed sequential

pattern using the dynamic bit vector based on the MapReduce framework called DSPDBV. The rest of

this article is constructed as follows. Section 2 contains the relevant preliminary concepts. Section 3

gives a brief description of the related works. Section 4 shows the proposed algorithm. Section 5

presents the experimental results, and the conclusions and future studies are included in Section 6.

2. Problem Definition

This section defines some basic concepts of sequence dataset mining and the issue of sequential

pattern mining.

Let I = {i1, i2, i3, …, in} be a set of distinct items (or events), where {ij} represents an item and 1

≤ j ≤ n. A set of unordered items is called an itemset, and each itemset is represented in brackets. The

lexicographical order denoted by is defined as any total on I. All items in each itemset are

ordered based on . A sequence S = (e1, e2, e3, …, em) represents an ordered list of itemsets, where

100

ej represents an itemset (1 ≤ j ≤ m). The size of a sequence represents the number m of itemsets in the

sequence. A sequence with length k, called a k-sequence, represents the number of items in the

sequence. A sequence dataset SDB is a list of sequences and is denoted as SDB = {S1, S2, S3, ⋯,

S|SDB|}, where |SDB| represents the number of sequences in SDB, and Si (1 ≤ i ≤ |SDB|) represents the

i-th sequence in SDB.

This article determines the full set of sequential patterns in a large sequence dataset SDB with a

given minSup value.

3. Related Work

Since Agrawal and Srikant proposed the AprioriAll [2] algorithm for the sequential mining

problem, sequential pattern mining has frequently studied. Initially, these algorithms performed linear

processing. Then, when the size of the dataset increase, the sequential pattern mining has become

inefficient. Thus, some studies have proposed algorithms for parallel mining sequential patterns.

Based on the characteristics of linear sequential pattern mining, these algorithms can be classified in

some approaches: the horizontal dataset format algorithms, such as AprioriAll [2] and GSP [3]; the

vertical dataset format algorithms such as SPADE [4], SPAM [5], and LAPIN-SPAM [6]; the pattern

growth algorithms such as FreeSpan [7] and PrefixSpan [8].

Algorithms that used the data structure in a vertical format have proven to be more effective

than others because they only scan the dataset one or two times. Furthermore, during the mining

process, they only use the information in the vertical format table to generate new candidates and

count their support by logic and bitwise operators. The SPADE [4] is a typical algorithm designed in

this approach. The SPADE represents the data as an IDList structure. To reduce the memory usages of

SPADE, SPAM, bitSPACE [12], and LAPIN-SPAM [6] used a bit vector data structure to represent

the data, where the value 1 corresponds to the occurrence of items in each transaction; otherwise, the

values are set to 0. To eliminate redundant bits (0), Vo et al. proposed a dynamic bit vector structure

(DBV) in DBV-Miner [13]. Then, Tran proposed the CloFS-DBV algorithm [14].

In addition, to eliminate the redundant candidates early, Philippe Fournier-Viger et al. proposed

the CMAP structure [15] and CM-SPAM algorithm [15] in 2014. Data is stored in CMAP to map each

item k in I to a set of items that can be associated with item k in two types of pattern extensions to get

new candidates. With a given minSup, CMAP is considered as a reference table of frequent 2-

sequence that satisfies the minSup.

The sequence datasets tend to increase quickly and be more complex. Consequently, these

datasets lead to a large space of mining being used. Therefore, the problem of mining sequential

patterns consumes more resources and time. Thus, a parallel model is the ideal solution for mining

sequential patterns. Some parallel algorithms were proposed, such as pSPADE [16]. In addition, PIB-

PRISM [17] and pDBV-SPM [18] algorithms have been proposed based on a multi-core processor.

Although these methods have good performance and scalability, they still suffer from limitations, such

as limited resources, communication costs, and load balancing.

Hadoop MapReduce [19] is based on the Google MapReduce platform, allowing substantial

amounts of distributed data across computer clusters with high reliability and availability. The

algorithms that have been designed on this platform include PTDS [10] and SPAMC [11]. The PTDS

divides the original dataset into small datasets and executes the mining based on PrefixSpan. The

limitation of PTDS is that it mines all the set of candidates in the subset of the dataset, so it does not

efficiently work on long sequences. Alternatively, SPAMC is based on SPAM. Instead of mining on a

full lexigraphy dictionary, it divides patterns into sub-trees and uses loops based on the MapReduce

model for each sub-tree to find patterns. However, the SPAMC algorithm is not effective for large

distinct items, since it takes a long time to transfer data from the MapReduce function.

4. The Proposed Algorithm

4.1. Data representation

The data structure of the proposed algorithm uses a dynamic bit vector structure that combines

the location information of the patterns on each transaction that was first proposed in the CloFS-DBV

algorithm [14], known as the CloFS-DBVPattern. In the DSPDBV algorithm, we make improvements

to the data structure to increase the algorithm's performance. Instead of being represented as a list of

positions, we use a bit vector to represent the positions of the sequential patterns on the transactions.

101

To extend the patterns, we prebuilt the reference matrix for the positioning process of the candidate

pattern on the transactions, called the sMATRIX and iMATRIX.

4.2. DSPDBV Algorithm

Given a sequence dataset SDB and a sequence S ∈ SDB, suppose SDB is divided into n

partitions. Thus, SDB = p1 + p2 + … + pn. Then, sup(S/SDB) = sup(S/p1) + sup(S/p2) + … + sup(S/pn).

There are three remarks: (1) Given a minimum threshold support (minSup), the sequence S is called a

pattern if sup(S/p1) + sup(S/p2) + … + sup(S/pn) ≥ minSup; (2) Assume S is any pattern of SDB; if S

exists in pi (1≤i≤n), then S is also considered a pattern in pi; (3) Given two patterns SA and SB, where

|SA| < |SB|, the candidate generation time of SA is less than that of SB because SA has fewer candidates

than SB.

 Distributed mining using MapReduce

Table 1: DSPDBV algorithm

Algorithm 1: DSPDBV

Input: an SDB, a minSup and the maximum depth of child prefix tree depth

Output: complete set of frequent sequential patterns

let isExists;

let partitions = split SDB into n partitions;

call DistributedDBVConversion (partitions, minSup);

let isExists = check if exist output for listDBVItems;

while (isExists)

call DistributedPatMining(listDBVPattens, minSup, depth);

let isExists = check if exist output for listDBVPatterns;

MapReduce programing model is used for large data processing based on the theory of parallel

computing and distributed data processing model on clusters of computers. The proposed algorithm

(Algorithm 1) is implemented on a MapReduce model for parallel and distributed sequential pattern

mining. In which, a Map function performs a data conversion and generates candidates. A Reduce

function calculates the support of candidates, and candidates that satisfy the minSup value are stored.

The implementation process is then divided into two phases: (1) DBV Conversion Phase: It is done by

a Map and a Reduce function. The Map function performs Step 2, and the Reduce function performs

Step 3 of the algorithm; (2) Sequential Pattern Mining Phase: It is also done by a Map and Reduce

function. It performs Steps 4 and 5. In the mining process, this phase calls back based on the loop of

the execution of MapReduce process. MapReduce works on data defined by a pair <key, value>,

where key represents an item or a sequential pattern, and value represents a BlockInfo or a support.

Thus, while the execution of MapReduce process, the algorithm must convert the data into (key,

value). The details of this process are described in the following section.

Table 1 shows the pseudocode of DSPDBV algorithm. First, it divides SDB into n partitions

(line 2). Then, it scans these partitions to convert sequences into the DBVItem structure and find 2-

sequences (line 3). Lines 5-7 perform the sequential patterns by mining subtrees in the same level, the

candidates are generated by extending nodes on these subtrees. This process is repeated until there are

no new candidates.

 Distributed DBV Conversion

The dataset after being partitioned is distributed and stored in HDFS. In the DBV Converting

Phase, the data conversion process is performed in parallel by executing a MapReduce job process

(Table 2), and a Map function is called to do a partition. Each Map function builds the DBVItem

dataset. The Map function also finds sets of 2-sequences to build CMAP (line 1). There are two types

of key-value pairs that are output to the Reduce function, and key-value pairs with the same key are

transferred to the same Reduce function. If the data is DBVItems then key-value is <item, blockinfo>.

In which, the key is an item in DBVItem, and the value is blockinfo of this item (lines 2-3). The other

102

is <pattern, support> if data is 2-sequences where the 2-sequence key and value support this sequence

(lines 4-5).

Table 2: DistributedDBVConversion algorithm

Algorithm 2: DistributedDBVConversion

Mapper Side

Input: a partition of sequence database p

scan partition data only one time to

a. DBVItems<item,blockInfo>construct DBVItem for each item

b. sequences<pattern,support>find 2-sequences and its support

for each (dbv in DBVItems)

output <dbv.item, dbv.blockInfo>

for each (item in sequences)

output <item.pattern, item.support>

Reducer Side

Input: mapper output pairs <key, values> and a minSup

Output: complete set of frequent items, 2-sequences and set of DBVItems

divide by partition (listDBVItem)

let support;

let blockInfos<BlockInfo>;

if (key is pattern) // pattern is 2-sequence

for each (value in values)

let support = sum of sup(value);

if (support >= minSup)

output <key, support>;

else if (key is item)

for each (value in values)

10.

let support = sum of sup(value);

11.

add value to blockInfos;

12.

if (support >= minSup)

13.

output <key, support>;

14.

for each (blockInfo in blockInfos)

15.

output <key, blockInfo> divide by partition;

Reduce function calculates the sum support of items or 2-sequences. With the key is 2-

sequences, sum support is calculated by the sum of the value and eliminates the 2-sequences that do

not satisfy the minSup. (lines 8-12). With the key is the item, the support is counted by DBV in

BlockInfo, and items that meet the minSup condition are retained in each corresponding partition

(lines 13–15).

 Distributed Sequential Pattern Mining

In the sequential pattern mining process, the exploitation of the entire candidate patterns on the

DBVTree prefix tree for large data sets is ineffective due to the wide search space that requires a lot of

resources. In addition, branches of the tree are independent of each other, so the process of generating

candidates can be performed independently.

103

Thus, the proposed algorithm is designed to perform a tree level to make the mining process

more flexible and require fewer resources. Candidates in the sub-prefix tree are exploited

independently in parallel by implementing MapReduce processes. Furthermore, the branches of the

tree have different heights, with the parallel exploitation of the sub-prefix trees high enough to make

the mining process achieve better load balancing. In addition, the algorithm is designed to optimize the

data transfer between Map and Reduce, and details of the implementation of distributed mining are

described in Algorithm 3 (Table 3).

Table 3: Distributed Sequence Pattern Mining algorithm

Algorithm 3: DistributedPatMining

Mapper Side

Input: a list of root nodes of child prefix tree roots, a minSup and the

maximum depth of child prefix tree depth

scan listDBVItem of particular partition only one time to

a. load list of DBVItems into DBV<item,blockInfo>

b. extract list item frequent items into items<item>

scan 2-sequence data only one time to construct CMAP

for each (root in roots)

call DBVPattenExt(root, DBV, CMAP, items, items, 1, depth);

Reducer Side

Input: mapper output pairs <key, values>, a minSup and the maximum depth

of child prefix tree depth

Output: complete set of frequent sequential patterns and set of

DBVPatterns divide by partition (listDBVPattern)

let support;

let blockInfos<BlockInfo>;

for each (value in values)

let support = sum of sup(value);

if (key is a node at lowest depth in child prefix tree)

10.

add value to blockInfos

11.

if (support >= minSup)

12.

output <key, support>;

13.

if (key is a node at lowest depth in child prefix tree)

14.

for each (blockInfo in blockInfos)

15.

output <key, blockInfo> divide by partition;

Generate Candidate Pattern: Candidates are generated in parallel by the Map function. The

algorithm uses the prefix tree structure, where nodes arranged in the order of the dictionary, helping

the process of generating the candidate pattern is highly effective. Each node in the tree is a candidate

pattern, expressed in the DBVPattern structure. Data is then divided into subsets so that each node in

the tree is independently exploited in different Map functions to improve execution time. The

extending subtree process is done in three main steps: (1) building a frequent DBV set, (2)

constructing a CMAP structure, and (3) expanding the subtree (Table 3).

Find the sequential pattern: The Reduce function calculates the total support of the candidate

patterns. In addition, if the candidate pattern is the node at the last level of the tree, the BlockInfo

information of the candidate pattern is retained (lines 7-10). Candidates that meet the minSup

condition are retained (lines 10-11), and the DBVPatten information of these candidates are stored and

distributed corresponding to the input data (listDBVPattern datasets) to perform a sequential mining

phase (lines 13-15) (Table 4).

Table 5 describes the OutputDBVPattern algorithm called in the DBVPatternExt algorithm to

perform the resulting transfer from the Map function to the Reduce function. Instead of transmitting all

An efficient-distributed model for mining sequential patterns on a large sequence dataset

This article proposes a distributed algorithm to deal with this problem, called the distributed algorithm for sequential pattern mining on a large sequence dataset using dynamic vector bit structures on the MapReduce distributed programming model (DSPDBV).

Từ khoá:

Distributed sequential pattern mining

Big data

Data mining

Dynamic bit vector

Sequential pattern

Có thể bạn quan tâm

Những lợi ích của Big Data và phân tích dữ liệu cho công tác quản trị nhân lực tại các doanh nghiệp

Big Data and business network analysis: Applications in management and optimization

Harnessing the power of Big Data: Transforming market prediction and supply chain optimization

Bài giảng Khai thác dữ liệu và ứng dụng: Tổng quan về khóa học và Giới thiệu về khai thác dữ liệu

Bài giảng Khai thác dữ liệu và ứng dụng: Tiền xử lý dữ liệu (Data Pre-Processing)

Bài thuyết trình Ứng dụng khai thác dữ liệu (Data Mining and Application): Logistic Regression

Bài thuyết trình Ứng dụng khai thác dữ liệu (Data Mining and Application): Decision Tree

Bài thuyết trình Ứng dụng khai thác dữ liệu (Data Mining and Application): Artificial neural network for classification

Bài thuyết trình Ứng dụng khai thác dữ liệu (Data Mining and Application): Partitioning Method K-Means

Bài thuyết trình Ứng dụng khai thác dữ liệu (Data Mining and Application): Hierarchical Clustering

Bài thuyết trình Ứng dụng khai thác dữ liệu (Data Mining and Application): Density-based method

Bài thuyết trình Ứng dụng khai thác dữ liệu (Data Mining and Application): Khai thác dữ liệu (Frequent Patterns Mining)

Bài thuyết trình Ứng dụng khai thác dữ liệu (Data Mining and Application): Sequential Pattern Mining

Bài thuyết trình Ứng dụng khai thác dữ liệu (Data Mining and Application): Trajectory Data Mining

Bài thuyết trình Ứng dụng khai thác dữ liệu (Data Mining and Application): Linear Models For Anomaly Detection

Bài thuyết trình Ứng dụng khai thác dữ liệu (Data Mining and Application): Density-based anomaly detection

Bài thuyết trình Ứng dụng khai thác dữ liệu (Data Mining and Application): DNN-Based Anomaly Detection

Bài thuyết trình Ứng dụng khai thác dữ liệu (Data Mining and Application): Anime Recommendation-System

Bài thuyết trình Ứng dụng khai thác dữ liệu (Data Mining and Application): COMPAS Recidivism Racial Bias

Bài thuyết trình Ứng dụng khai thác dữ liệu (Data Mining and Application): Loan approval prediction

Tài liêu mới

Bài giảng Kỹ thuật tài liệu hóa hệ thống - Nguyễn Bích Liên

Bài giảng Yếu tố con người: Chương 2 - Các mô hình khái niệm về sự cố, tai nạn và mối liên quan với công tác kiểm soát không lưu

Bài giảng Yếu tố con người: Chương 3 - Các dạng và bản chất của sai sót có liên quan tới việc cung cấp ATS

Bài giảng Yếu tố con người: Chương 4 - Khả năng nhận thức và giới hạn của con người trong môi trường ĐHB

Câu hỏi ôn tập Cấu trúc dữ liệu và giải thuật

Câu hỏi ôn tập Cơ sở dữ liệu có đáp án

Bài giảng Cấu trúc dữ liệu và giải thuật: Bài 8 - Nguyễn Mạnh Sơn

Bài giảng Cấu trúc dữ liệu và giải thuật: Bài 7 - Nguyễn Mạnh Sơn

Bài giảng Cấu trúc dữ liệu và giải thuật: Bài 6 - Nguyễn Mạnh Sơn

Bài giảng Cấu trúc dữ liệu và giải thuật: Bài 5 - Nguyễn Mạnh Sơn

Bài giảng Cấu trúc dữ liệu và giải thuật: Bài 4 - Nguyễn Mạnh Sơn

Bài giảng Cấu trúc dữ liệu và giải thuật: Bài 3 - Nguyễn Mạnh Sơn

Bài giảng Cấu trúc dữ liệu và giải thuật: Bài 2 - Nguyễn Mạnh Sơn

Bài giảng Cấu trúc dữ liệu và giải thuật: Bài 1 - Nguyễn Mạnh Sơn

Bài giảng Cấu trúc dữ liệu và giải thuật: Giới thiệu môn học - Nguyễn Mạnh Sơn

AI tóm tắt

Giới thiệu tài liệu

Đối tượng sử dụng

Từ khoá chính

Nội dung tóm tắt

Giới thiệu

Về chúng tôi

Việc làm

Quảng cáo

Liên hệ

Chính sách

Thoả thuận sử dụng

Chính sách bảo mật

Chính sách hoàn tiền

DMCA

Hỗ trợ

Hướng dẫn sử dụng

Đăng ký tài khoản VIP

093 303 0098

support@tailieu.vn

Phương thức thanh toán

Theo dõi chúng tôi

Facebook

Youtube

TikTok