Khai thác Weighted Sequential Patterns trong Sequence Database: Tóm tắt luận án Tiến sĩ ngành Computer

MINISTRY OF EDUCATION

AND TRAINING

VIETNAM ACADEMY OF

SIENCE AND TECHNOLOGY

GRADUATE UNIVERSITY OF SIENCE AND TECHNOLOGY

……..….***…………

TRAN HUY DUONG

MINING WEIGHTED SEQUENTIAL PATTERNS

IN SEQUENCE DATABASE

Major: Information System

Major code: 62 48 01 04

SUMMARY OF COMPUTER DOCTORAL THESIS

Ha Noi – 2021

The thesis has been completed at: Graduate University of Science

and Technology- Vietnam Academy of Science and Technology

Supervisor 1: Dr. Nguyen Truong Thang

Supervisor 2: Prof. Dr. Vu Duc Thi

Reviewer 1: …

Reviewer 2: …

Reviewer 3: ….

The thesis shall be defended in front of the Thesis Committee at Vietnam

Academy Of Science And Technology - Graduate University Of Science

And Technology, at ….… hour……, date…… month….…year 2021

This thesis could be found at:

- The National Library of Vietnam

- The Library of Graduate University of Science and Technology

INTRODUCTION

1. Overview

Mining frequent sequence patterns is one of the important issues and is studied by

many scholars in the field of data mining. Sequential pattern mining has many real-life

applications since data is encoded as sequences in many fields such as bioinformatics,

e-learning, market basket analysis, text analysis, and webpage click-stream analysis.

This field of research has emerged in the 1990s with the seminal paper of Agrawal and

Srikant [2].

Usually, the sequential pattern mining does not include additional information.

Meanwhile, the expansion of the information of the data series is very diverse such as

adding information about the weights of the items in the sequence, information about

the quantity of the items in the sequence, information about the time interval.

For sequence databases with the time interval, the time interval of the occurrence

of data series allows analysis of how long after the sequence patterns will appear. The

studies so far have focused on detecting time-spaced sequence patterns that occur

between components in time-spaced sequence databases, where the time interval is a

well-defined numerical value.

Conventional classical sequence pattern mining algorithms do not care about the

importance of each data item in the sequence (weights) nor the number of data items

in each data item in the sequence (quantitative). However, in practice, each data series

has different importance, including the value of internal (quantitative) and external

(weighted) benefits of the data series in the database.

Up to now, there have not been many studies on weighted sequence pattern mining

in time-spaced sequence databases, which are interested in both the weight of each item

in the data series and the time interval between the sequences.

At the same time, there have not been many studies on high utility sequence pattern

mining in time-spaced quantitative sequence databases, including the weight of each

item in the data series, the quantitative value of the items and the time interval between

the sequences in the quantitative sequence database has a time interval.

That is the reason for proposing the thesis "Mining Weighted Sequential Patterns

in Sequence Database". The thesis proposes and solves the problem of weighted

frequent sequence pattern mining in sequence databases with time interval and high

utility sequential pattern mining in quantitative sequence database with time interval.

2. Research aim and area

The objective of the thesis is to propose a solution to mine the weighted sequential

patterns between sequences in sequence databases with time interval and quantitative

sequence databases with time interval.

The thesis focuses on proposing solutions to:

• Mining weighted sequential patterns in sequence databases with time interval. The

sequential pattern found are then called the weighted sequential pattern with time

interval.

• Mining weighted sequential patterns in quantitative sequence databases with time

interval. The sequential pattern found are then called the high utility sequential

pattern with time interval.

I focus on researching and proposing new algorithms for mining frequent sequence

patterns, demonstrate correctness and completeness, analyze the computational

complexity of algorithms, test and analyze the significance of the frequently mined

sequential pattern.

3. Research methodology

The thesis studies the sequences the weights of the items in the sequence, the time

interval between the sequences, the sequence databases with time interval and the

quantitative sequence databases with time interval. Current studies, algorithms and

methods of pattern mining on sequence databases, quantitative sequence databases

have factors of time distance, weight, and high utility.

4. New contributions of the thesis

The main contributions of the thesis are proposing and solve the following issues:

• Propose an algorithm to mining top-k frequent sequential patterns taking in the

weights of items and time intervals in sequence databases with time interval. The

results of the work are posted at [CT1].

• Propose two algorithms for mining high utility sequential patterns that taking in the

weights of items, the quantitative value of each item and the time interval in

quantitative sequence databases with time interval. The results of the work are

posted at [CT2], [CT3], [CT4], [CT5].

5. The thesis layout

The thesis consists of an introduction, 03 content chapters and a conclusion:

• Introduction: Present an overview of the thesis; research objectives, objects and

scope, research methods, the main contributions and structure of the thesis.

• Chapter 1: Overview of the research situation and definitions of weighted sequence

pattern mining in the sequence databases, the sequence databases with time interval

and the quantitative sequence databases with time interval.

• Chapter 2: Weighted sequential pattern mining in sequence databases with time

interval. This chapter raises the problem and proposes an algorithm to mine the top-

k weighted sequential patterns in the sequence database with time interval. The

correctness and completeness of the algorithm, the experimentation of the algorithm

on real datasets and comparison with previous studies.

• Chapter 3: High utility sequential pattern mining in quantitative sequence databases

with time interval. This chapter raises the problem and proposes two high utility

sequential pattern mining algorithms in quantitative sequence databases with time

interval. The correctness and completeness of the algorithm, the experimentation of

the algorithm on real datasets and comparison with previous studies.

• Conclusion: Present some conclusions, the contributions of the thesis, development

direction and issues of interest of the PhD student.

OVERVIEW THE WEIGHTED SEQUENTIAL PATTERNS

MINING IN SEQUENCE DATABASE

This chapter presents an overview and basic definitions of the problems of mining

sequential patterns in sequence databases, weighted sequential patterns in sequence

databases, sequential patterns in quantitative sequence databases, and sequence

patterns in sequence databases with time interval. This chapter also points out the

unresolved gaps from which to define the research problem of the thesis.

1.1. Sequential pattern mining

Sequence database SDB: Let there be a set of items (symbols) I={i1, i2, …, in}.

Sequence S is defined 𝑆=〈(𝑠1),(𝑠2),…,(𝑠𝑚)〉, and sj is a non-empty subset of items

sorted alphabetically or one item of I (𝑠𝑗⊆𝐼) . A sequence database SDB is a set of

tuples <sid, Sk>, where sid is an identification of a sequence and Sk is a sequence.

Size of the sequence: sequence 𝑆=〈(𝑠1),(𝑠2),…,(𝑠𝑚)〉 is denoted by |S| is the

number of elements sj in the sequence S. For example, sequence

S=<(a)(abd)(f)(a)(d)> has size |S| =5.

Length of the sequence: sequence 𝑆=〈(𝑠1),(𝑠2),…,(𝑠𝑚)〉 where sj is a non-empty

subset of items sorted alphabetically or one item of I (𝑠𝑗⊆𝐼) is denoted by l(S) is

number of items 𝑖𝑗∈𝐼 in the sequence S. A sequence of length l is called a l-sequence.

For example, sequence S=<(a)(abd)(f)(a)(d)> has length l(S) =7.

Subsequence and supersequence: a sequence α=〈(𝑆1),(𝑆2),…,(𝑆𝑚)〉 is called a

subsequence of another sequence 𝛽=〈(𝑆′1),(𝑆′2),…,(𝑆′𝑚),…,(𝑆′𝑛)〉 if 𝑆𝑖⊆𝑆′𝑖 is

denoted by 𝛼≼𝛽. The sequence β is called a supersequence of α. For example,

sequence <(abd)(f)> and sequence <(ab)(d)> are subsequences of the supersequence

<(a)(abd)(f)(a)(d)>.

Support of the sequence: The (absolute) support of a sequence α in a sequence

database SDB is defined as the number of sequences that contain α, and is denoted by

sup(α). In other words, 𝑠𝑢𝑝(𝛼)=|{𝛼|𝛼≼𝑆𝑖∧𝑆𝑖∈𝑆𝐷𝐵}|

A sequence α is called a sequential pattern if and only if the support of α is not less

than a pre-defined minimum support threshold minsup. In other words, sup(α)minsup.

Most sequential pattern mining algorithms in sequence database are usually

implemented according to the following approaches:

• Breadth-first search algorithms. They first scan the database to find frequent 1-

sequences (sequential patterns containing a single item).Then, they generate 2-

sequences by performing s-extensions and i-extensions of 1-sequences, then 3-

sequences are generated using the 2-sequences, then 4-sequences are generated using

the 3-sequences,and so on until no sequences can be generated. The typical algorithms

developed with this approach include: AprioriAll[2], GSP[10]...

• Depth-first search algorithms. They start from the sequences containing single

items, and then recursively perform i-extensions and s-extensions with one of these

sequences to generate larger sequences. Then, when a pattern can no longer be

extended, the algorithm backtrack to generate other patterns using other sequences.

The typical algorithms developed with this approach include: Spade [11], Spam [30],

Lapin-Spam [18], CM-Spade, CM-Spam [17], FreeSpan [13], PrefixSpan [31]….

1.2. Weighted sequential pattern mining

Weighted sequence database SDB: Let there be a set of items (symbols) I={i1, i2,

…, in}. Sequence S is defined 𝑆=〈(𝑠1),(𝑠2),…,(𝑠𝑚)〉, and sj is a non-empty subset of

items sorted alphabetically or one item of I (𝑠𝑗⊆𝐼) . A weighted sequence database

Summary of Computer doctoral thesis: Mining weighted sequential patterns in sequence database

The objective of the thesis is to propose a solution to mine the weighted sequential patterns between sequences in sequence databases with time interval and quantitative sequence databases with time interval.

Chủ đề:

Tài liệu liên quan

Tài liêu mới

AI tóm tắt

Giới thiệu tài liệu

Đối tượng sử dụng

Từ khoá chính

Nội dung tóm tắt

Hỗ trợ

Phương thức thanh toán

Theo dõi chúng tôi