MINISTRY OF EDUCATION
AND TRAINING
VIETNAM ACADEMY OF
SIENCE AND TECHNOLOGY
GRADUATE UNIVERSITY OF SIENCE AND TECHNOLOGY
……..….***…………
TRAN HUY DUONG
MINING WEIGHTED SEQUENTIAL PATTERNS
IN SEQUENCE DATABASE
Major: Information System
Major code: 62 48 01 04
SUMMARY OF COMPUTER DOCTORAL THESIS
Ha Noi 2021
The thesis has been completed at: Graduate University of Science
and Technology- Vietnam Academy of Science and Technology
Supervisor 1: Dr. Nguyen Truong Thang
Supervisor 2: Prof. Dr. Vu Duc Thi
Reviewer 1: …
Reviewer 2: …
Reviewer 3: ….
The thesis shall be defended in front of the Thesis Committee at Vietnam
Academy Of Science And Technology - Graduate University Of Science
And Technology, at ….… hour……, date…… month….…year 2021
This thesis could be found at:
- The National Library of Vietnam
- The Library of Graduate University of Science and Technology
1
INTRODUCTION
1. Overview
Mining frequent sequence patterns is one of the important issues and is studied by
many scholars in the field of data mining. Sequential pattern mining has many real-life
applications since data is encoded as sequences in many fields such as bioinformatics,
e-learning, market basket analysis, text analysis, and webpage click-stream analysis.
This field of research has emerged in the 1990s with the seminal paper of Agrawal and
Srikant [2].
Usually, the sequential pattern mining does not include additional information.
Meanwhile, the expansion of the information of the data series is very diverse such as
adding information about the weights of the items in the sequence, information about
the quantity of the items in the sequence, information about the time interval.
For sequence databases with the time interval, the time interval of the occurrence
of data series allows analysis of how long after the sequence patterns will appear. The
studies so far have focused on detecting time-spaced sequence patterns that occur
between components in time-spaced sequence databases, where the time interval is a
well-defined numerical value.
Conventional classical sequence pattern mining algorithms do not care about the
importance of each data item in the sequence (weights) nor the number of data items
in each data item in the sequence (quantitative). However, in practice, each data series
has different importance, including the value of internal (quantitative) and external
(weighted) benefits of the data series in the database.
Up to now, there have not been many studies on weighted sequence pattern mining
in time-spaced sequence databases, which are interested in both the weight of each item
in the data series and the time interval between the sequences.
At the same time, there have not been many studies on high utility sequence pattern
mining in time-spaced quantitative sequence databases, including the weight of each
item in the data series, the quantitative value of the items and the time interval between
the sequences in the quantitative sequence database has a time interval.
That is the reason for proposing the thesis "Mining Weighted Sequential Patterns
in Sequence Database". The thesis proposes and solves the problem of weighted
frequent sequence pattern mining in sequence databases with time interval and high
utility sequential pattern mining in quantitative sequence database with time interval.
2. Research aim and area
The objective of the thesis is to propose a solution to mine the weighted sequential
patterns between sequences in sequence databases with time interval and quantitative
sequence databases with time interval.
The thesis focuses on proposing solutions to:
Mining weighted sequential patterns in sequence databases with time interval. The
sequential pattern found are then called the weighted sequential pattern with time
interval.
Mining weighted sequential patterns in quantitative sequence databases with time
interval. The sequential pattern found are then called the high utility sequential
pattern with time interval.
2
I focus on researching and proposing new algorithms for mining frequent sequence
patterns, demonstrate correctness and completeness, analyze the computational
complexity of algorithms, test and analyze the significance of the frequently mined
sequential pattern.
3. Research methodology
The thesis studies the sequences the weights of the items in the sequence, the time
interval between the sequences, the sequence databases with time interval and the
quantitative sequence databases with time interval. Current studies, algorithms and
methods of pattern mining on sequence databases, quantitative sequence databases
have factors of time distance, weight, and high utility.
4. New contributions of the thesis
The main contributions of the thesis are proposing and solve the following issues:
Propose an algorithm to mining top-k frequent sequential patterns taking in the
weights of items and time intervals in sequence databases with time interval. The
results of the work are posted at [CT1].
Propose two algorithms for mining high utility sequential patterns that taking in the
weights of items, the quantitative value of each item and the time interval in
quantitative sequence databases with time interval. The results of the work are
posted at [CT2], [CT3], [CT4], [CT5].
5. The thesis layout
The thesis consists of an introduction, 03 content chapters and a conclusion:
Introduction: Present an overview of the thesis; research objectives, objects and
scope, research methods, the main contributions and structure of the thesis.
Chapter 1: Overview of the research situation and definitions of weighted sequence
pattern mining in the sequence databases, the sequence databases with time interval
and the quantitative sequence databases with time interval.
Chapter 2: Weighted sequential pattern mining in sequence databases with time
interval. This chapter raises the problem and proposes an algorithm to mine the top-
k weighted sequential patterns in the sequence database with time interval. The
correctness and completeness of the algorithm, the experimentation of the algorithm
on real datasets and comparison with previous studies.
Chapter 3: High utility sequential pattern mining in quantitative sequence databases
with time interval. This chapter raises the problem and proposes two high utility
sequential pattern mining algorithms in quantitative sequence databases with time
interval. The correctness and completeness of the algorithm, the experimentation of
the algorithm on real datasets and comparison with previous studies.
Conclusion: Present some conclusions, the contributions of the thesis, development
direction and issues of interest of the PhD student.
OVERVIEW THE WEIGHTED SEQUENTIAL PATTERNS
MINING IN SEQUENCE DATABASE
This chapter presents an overview and basic definitions of the problems of mining
sequential patterns in sequence databases, weighted sequential patterns in sequence
databases, sequential patterns in quantitative sequence databases, and sequence
3
patterns in sequence databases with time interval. This chapter also points out the
unresolved gaps from which to define the research problem of the thesis.
1.1. Sequential pattern mining
Sequence database SDB: Let there be a set of items (symbols) I={i1, i2, …, in}.
Sequence S is defined 𝑆=(𝑠1),(𝑠2),,(𝑠𝑚), and sj is a non-empty subset of items
sorted alphabetically or one item of I (𝑠𝑗𝐼) . A sequence database SDB is a set of
tuples <sid, Sk>, where sid is an identification of a sequence and Sk is a sequence.
Size of the sequence: sequence 𝑆=(𝑠1),(𝑠2),,(𝑠𝑚) is denoted by |S| is the
number of elements sj in the sequence S. For example, sequence
S=<(a)(abd)(f)(a)(d)> has size |S| =5.
Length of the sequence: sequence 𝑆=(𝑠1),(𝑠2),,(𝑠𝑚) where sj is a non-empty
subset of items sorted alphabetically or one item of I (𝑠𝑗𝐼) is denoted by l(S) is
number of items 𝑖𝑗𝐼 in the sequence S. A sequence of length l is called a l-sequence.
For example, sequence S=<(a)(abd)(f)(a)(d)> has length l(S) =7.
Subsequence and supersequence: a sequence α=(𝑆1),(𝑆2),,(𝑆𝑚) is called a
subsequence of another sequence 𝛽=(𝑆′1),(𝑆′2),,(𝑆′𝑚),,(𝑆′𝑛) if 𝑆𝑖𝑆𝑖 is
denoted by 𝛼𝛽. The sequence β is called a supersequence of α. For example,
sequence <(abd)(f)> and sequence <(ab)(d)> are subsequences of the supersequence
<(a)(abd)(f)(a)(d)>.
Support of the sequence: The (absolute) support of a sequence α in a sequence
database SDB is defined as the number of sequences that contain α, and is denoted by
sup(α). In other words, 𝑠𝑢𝑝(𝛼)=|{𝛼|𝛼𝑆𝑖𝑆𝑖𝑆𝐷𝐵}|
A sequence α is called a sequential pattern if and only if the support of α is not less
than a pre-defined minimum support threshold minsup. In other words, sup(α)minsup.
Most sequential pattern mining algorithms in sequence database are usually
implemented according to the following approaches:
Breadth-first search algorithms. They first scan the database to find frequent 1-
sequences (sequential patterns containing a single item).Then, they generate 2-
sequences by performing s-extensions and i-extensions of 1-sequences, then 3-
sequences are generated using the 2-sequences, then 4-sequences are generated using
the 3-sequences,and so on until no sequences can be generated. The typical algorithms
developed with this approach include: AprioriAll[2], GSP[10]...
Depth-first search algorithms. They start from the sequences containing single
items, and then recursively perform i-extensions and s-extensions with one of these
sequences to generate larger sequences. Then, when a pattern can no longer be
extended, the algorithm backtrack to generate other patterns using other sequences.
The typical algorithms developed with this approach include: Spade [11], Spam [30],
Lapin-Spam [18], CM-Spade, CM-Spam [17], FreeSpan [13], PrefixSpan [31].
1.2. Weighted sequential pattern mining
Weighted sequence database SDB: Let there be a set of items (symbols) I={i1, i2,
…, in}. Sequence S is defined 𝑆=(𝑠1),(𝑠2),,(𝑠𝑚), and sj is a non-empty subset of
items sorted alphabetically or one item of I (𝑠𝑗𝐼) . A weighted sequence database