intTypePromotion=1
zunia.vn Tuyển sinh 2024 dành cho Gen-Z zunia.vn zunia.vn
ADSENSE

Summary of Computer doctoral thesis: Mining weighted sequential patterns in sequence database

Chia sẻ: _ _ | Ngày: | Loại File: PDF | Số trang:25

16
lượt xem
3
download
 
  Download Vui lòng tải xuống để xem tài liệu đầy đủ

The objective of the thesis is to propose a solution to mine the weighted sequential patterns between sequences in sequence databases with time interval and quantitative sequence databases with time interval.

Chủ đề:
Lưu

Nội dung Text: Summary of Computer doctoral thesis: Mining weighted sequential patterns in sequence database

  1. MINISTRY OF EDUCATION VIETNAM ACADEMY OF AND TRAINING SIENCE AND TECHNOLOGY GRADUATE UNIVERSITY OF SIENCE AND TECHNOLOGY ……..….***………… TRAN HUY DUONG MINING WEIGHTED SEQUENTIAL PATTERNS IN SEQUENCE DATABASE Major: Information System Major code: 62 48 01 04 SUMMARY OF COMPUTER DOCTORAL THESIS Ha Noi – 2021
  2. The thesis has been completed at: Graduate University of Science and Technology- Vietnam Academy of Science and Technology Supervisor 1: Dr. Nguyen Truong Thang Supervisor 2: Prof. Dr. Vu Duc Thi Reviewer 1: … Reviewer 2: … Reviewer 3: …. The thesis shall be defended in front of the Thesis Committee at Vietnam Academy Of Science And Technology - Graduate University Of Science And Technology, at ….… hour……, date…… month….…year 2021 This thesis could be found at: - The National Library of Vietnam - The Library of Graduate University of Science and Technology
  3. INTRODUCTION 1. Overview Mining frequent sequence patterns is one of the important issues and is studied by many scholars in the field of data mining. Sequential pattern mining has many real-life applications since data is encoded as sequences in many fields such as bioinformatics, e-learning, market basket analysis, text analysis, and webpage click-stream analysis. This field of research has emerged in the 1990s with the seminal paper of Agrawal and Srikant [2]. Usually, the sequential pattern mining does not include additional information. Meanwhile, the expansion of the information of the data series is very diverse such as adding information about the weights of the items in the sequence, information about the quantity of the items in the sequence, information about the time interval. For sequence databases with the time interval, the time interval of the occurrence of data series allows analysis of how long after the sequence patterns will appear. The studies so far have focused on detecting time-spaced sequence patterns that occur between components in time-spaced sequence databases, where the time interval is a well-defined numerical value. Conventional classical sequence pattern mining algorithms do not care about the importance of each data item in the sequence (weights) nor the number of data items in each data item in the sequence (quantitative). However, in practice, each data series has different importance, including the value of internal (quantitative) and external (weighted) benefits of the data series in the database. Up to now, there have not been many studies on weighted sequence pattern mining in time-spaced sequence databases, which are interested in both the weight of each item in the data series and the time interval between the sequences. At the same time, there have not been many studies on high utility sequence pattern mining in time-spaced quantitative sequence databases, including the weight of each item in the data series, the quantitative value of the items and the time interval between the sequences in the quantitative sequence database has a time interval. That is the reason for proposing the thesis "Mining Weighted Sequential Patterns in Sequence Database". The thesis proposes and solves the problem of weighted frequent sequence pattern mining in sequence databases with time interval and high utility sequential pattern mining in quantitative sequence database with time interval. 2. Research aim and area The objective of the thesis is to propose a solution to mine the weighted sequential patterns between sequences in sequence databases with time interval and quantitative sequence databases with time interval. The thesis focuses on proposing solutions to: • Mining weighted sequential patterns in sequence databases with time interval. The sequential pattern found are then called the weighted sequential pattern with time interval. • Mining weighted sequential patterns in quantitative sequence databases with time interval. The sequential pattern found are then called the high utility sequential pattern with time interval. 1
  4. I focus on researching and proposing new algorithms for mining frequent sequence patterns, demonstrate correctness and completeness, analyze the computational complexity of algorithms, test and analyze the significance of the frequently mined sequential pattern. 3. Research methodology The thesis studies the sequences the weights of the items in the sequence, the time interval between the sequences, the sequence databases with time interval and the quantitative sequence databases with time interval. Current studies, algorithms and methods of pattern mining on sequence databases, quantitative sequence databases have factors of time distance, weight, and high utility. 4. New contributions of the thesis The main contributions of the thesis are proposing and solve the following issues: • Propose an algorithm to mining top-k frequent sequential patterns taking in the weights of items and time intervals in sequence databases with time interval. The results of the work are posted at [CT1]. • Propose two algorithms for mining high utility sequential patterns that taking in the weights of items, the quantitative value of each item and the time interval in quantitative sequence databases with time interval. The results of the work are posted at [CT2], [CT3], [CT4], [CT5]. 5. The thesis layout The thesis consists of an introduction, 03 content chapters and a conclusion: • Introduction: Present an overview of the thesis; research objectives, objects and scope, research methods, the main contributions and structure of the thesis. • Chapter 1: Overview of the research situation and definitions of weighted sequence pattern mining in the sequence databases, the sequence databases with time interval and the quantitative sequence databases with time interval. • Chapter 2: Weighted sequential pattern mining in sequence databases with time interval. This chapter raises the problem and proposes an algorithm to mine the top- k weighted sequential patterns in the sequence database with time interval. The correctness and completeness of the algorithm, the experimentation of the algorithm on real datasets and comparison with previous studies. • Chapter 3: High utility sequential pattern mining in quantitative sequence databases with time interval. This chapter raises the problem and proposes two high utility sequential pattern mining algorithms in quantitative sequence databases with time interval. The correctness and completeness of the algorithm, the experimentation of the algorithm on real datasets and comparison with previous studies. • Conclusion: Present some conclusions, the contributions of the thesis, development direction and issues of interest of the PhD student. OVERVIEW THE WEIGHTED SEQUENTIAL PATTERNS MINING IN SEQUENCE DATABASE This chapter presents an overview and basic definitions of the problems of mining sequential patterns in sequence databases, weighted sequential patterns in sequence databases, sequential patterns in quantitative sequence databases, and sequence 2
  5. patterns in sequence databases with time interval. This chapter also points out the unresolved gaps from which to define the research problem of the thesis. 1.1. Sequential pattern mining Sequence database SDB: Let there be a set of items (symbols) I={i1, i2, …, in}. Sequence S is defined 𝑆 = 〈(𝑠1 ), (𝑠2 ), … , (𝑠𝑚 )〉, and sj is a non-empty subset of items sorted alphabetically or one item of I (𝑠𝑗 ⊆ 𝐼) . A sequence database SDB is a set of tuples , where sid is an identification of a sequence and Sk is a sequence. Size of the sequence: sequence 𝑆 = 〈(𝑠1 ), (𝑠2 ), … , (𝑠𝑚 )〉 is denoted by |S| is the number of elements sj in the sequence S. For example, sequence S= has size |S| =5. Length of the sequence: sequence 𝑆 = 〈(𝑠1 ), (𝑠2 ), … , (𝑠𝑚 )〉 where sj is a non-empty subset of items sorted alphabetically or one item of I (𝑠𝑗 ⊆ 𝐼) is denoted by l(S) is number of items 𝑖𝑗 ∈ 𝐼 in the sequence S. A sequence of length l is called a l-sequence. For example, sequence S= has length l(S) =7. Subsequence and supersequence: a sequence α = 〈(𝑆1 ), (𝑆2 ), … , (𝑆𝑚 )〉 is called a subsequence of another sequence 𝛽 = 〈(𝑆′1 ), (𝑆′2 ), … , (𝑆′𝑚 ), … , (𝑆′𝑛 )〉 if 𝑆𝑖 ⊆ 𝑆 ′ 𝑖 is denoted by 𝛼 ≼ 𝛽. The sequence β is called a supersequence of α. For example, sequence and sequence are subsequences of the supersequence . Support of the sequence: The (absolute) support of a sequence α in a sequence database SDB is defined as the number of sequences that contain α, and is denoted by sup(α). In other words, 𝑠𝑢𝑝(𝛼) = |{𝛼|𝛼 ≼ 𝑆𝑖 ∧ 𝑆𝑖 ∈ 𝑆𝐷𝐵}| A sequence α is called a sequential pattern if and only if the support of α is not less than a pre-defined minimum support threshold minsup. In other words, sup(α)minsup. Most sequential pattern mining algorithms in sequence database are usually implemented according to the following approaches: • Breadth-first search algorithms. They first scan the database to find frequent 1- sequences (sequential patterns containing a single item).Then, they generate 2- sequences by performing s-extensions and i-extensions of 1-sequences, then 3- sequences are generated using the 2-sequences, then 4-sequences are generated using the 3-sequences,and so on until no sequences can be generated. The typical algorithms developed with this approach include: AprioriAll[2], GSP[10]... • Depth-first search algorithms. They start from the sequences containing single items, and then recursively perform i-extensions and s-extensions with one of these sequences to generate larger sequences. Then, when a pattern can no longer be extended, the algorithm backtrack to generate other patterns using other sequences. The typical algorithms developed with this approach include: Spade [11], Spam [30], Lapin-Spam [18], CM-Spade, CM-Spam [17], FreeSpan [13], PrefixSpan [31]…. 1.2. Weighted sequential pattern mining Weighted sequence database SDB: Let there be a set of items (symbols) I={i1, i2, …, in}. Sequence S is defined 𝑆 = 〈(𝑠1 ), (𝑠2 ), … , (𝑠𝑚 )〉, and sj is a non-empty subset of items sorted alphabetically or one item of I (𝑠𝑗 ⊆ 𝐼) . A weighted sequence database 3
  6. SDB is a set of tuples , where sid is an identification of a sequence and Sk is a sequence, and each item 𝑖𝑗 ∈ 𝐼 is assigned a weight wj. Some weighted sequential pattern mining algorithms in weighed sequence databases such as MWSP [32], Wspan [33], WSPM [34]… 1.3. Quantitative sequential pattern mining Quantitative sequence database QSDB: Let there be a set of items (symbols) I={i1, i2, …, in}. Sequence S is defined 𝑆 = 〈(𝑠1 ), (𝑠2 ), … , (𝑠𝑚 )〉, and sj = (i1[k1], i2[k2], …, in[kn]) with 𝑖𝑗 ∈ 𝐼 and kj show the item i quantities in sj. A quantitative sequence database QSDB is a set of tuples , where sid is an identification of a sequence and Sk is a quantitative sequence. Some high utility sequential pattern mining algorithms in quantitative sequence databases such as UL,US [42], Uspan [43], HuspExt [45], HUSPM [47], PHUS [44]... 1.4. Sequential pattern mining with time interval Sequence database with time interval iSDB: Let there be a set of items (symbols) I={i1, i2, …, in}. Sequence S is defined 𝑆 = 〈(𝑡1,1 , 𝑠1 ), (𝑡1,2 , 𝑠2 ), … , (𝑡1,𝑚 , 𝑠𝑚 )〉, and sj is a non-empty subset of items sorted alphabetically or one item of I (𝑠𝑗 ⊆ 𝐼) and t, is the time interval between s và s. . A sequence database with time interval iSDB is a set of tuples , where sid is an identification of a sequence and Sk is a sequence with time interval. Some sequential pattern with time interval mining algorithms such as I-Apriori algorithm, I-PrefixSpan [38], FTI-Apriori và FTI-PrefixSpan [39] .. − If the items in I are defined as shown in section 1.2 above, then we have the weighted sequential patterns in the sequence database with time interval. Some weighted sequential pattern with time interval mining algorithms such as WIPrefixSpan [40], TopKWFP [CT1] .. − If the items in I are defined as shown in section 1.2 and 1.3 above, then we have the high utility sequential patterns in the quantitative sequence database with time intervals Some high utility sequential pattern with time interval mining algorithms such as FSPFTIM [48], UIPrefixSpan [CT3], HUISP [CT5]…. Conclusion Chapter 1 Chapter 1 presented an overview and summarized the problems related to sequential pattern mining in sequence databases, weighted sequential pattern mining in sequence databases, sequential pattern mining in quantitative sequence databases, sequential pattern mining in sequence databases with time interval. The following figure 1.1 describes an overview of the research issues of the thesis. 4
  7. Figure 1.1. The thesis rearch problems Currently, there are not many studies on weighted sequential pattern mining in sequence databases with time interval, which are interested in both the weight of each item in the sequence and the time interval between the sequences. Similarly, there have not been many studies on high utility sequential pattern mining in quantitative sequence databases with time interval that take into account the weight of each item in the sequence, the quantitative value of the each item and time intervals between sequences in a quantitative sequence database with time interval. Specifically, the thesis will focus on researching proposals and solutions to solve the following problems: - Problem 1: Propose one mining algorithm for mining top-k of weighted frequent sequences with time interval which is interested in both the weight of each item in the sequence and the time interval between sequences in the sequence database with time interval. This will be discussed in detail in Chapter 2. - Problem 2: Propose two high utility sequential pattern mining algorithms in quantitative sequence databases with time interval, which are also interested in the weight of each item in the sequence, the quantitative value of each item and the distance 5
  8. time between sequences in a quantitative sequence database with time intervals. This will be discussed in detail in Chapter 3. MINING WEIGHTED SEQUENTIAL PATTERNS IN SEQUENCE DATABASE WITH TIME INTERVAL In this chapter, the thesis will focus on presenting the top-k mining of weighted sequential patterns with time interval. The problem of detecting the top-k weighted sequential patterns in the sequence database with time interval is done starting with the study to solve the problem of mining the top-k frequent sequence patterns in the sequence database using the method longitudinal database representation TKS [21] by Fournier-Viger et al. and weighted sequential pattern with time interval mining problem WIPrefixSpan [40] was proposed by Duong et al. and then continued conducting the research and proposed the TopKWFP algorithm, which was published in the Journal of Computer Science and Cybernetics [CT1]. 2.1. The basic concepts Subsequence and supersequence with time interval A sequence α = 〈(𝑡1,1 , 𝑆1 ), (𝑡1,2 , 𝑆2 ), … , (𝑡1,𝑚 , 𝑆𝑚 )〉 is called a subsequence of another sequence 𝛽 = 〈(𝑡′1,1 , 𝑆′1 ), (𝑡 ′1,2 , 𝑆′2 ), … , (𝑡 ′1,𝑚 , 𝑆′𝑚 ), … , (𝑡 ′1,𝑛 , 𝑆′𝑛 )〉 if and only if : 𝑆𝑖 ⊆ 𝑆 ′ 𝑖 ⋀ (|𝑡1,1 − 𝑡 ′1,1 | = |𝑡1,2 − 𝑡 ′1,2 |= ⋯ = |𝑡1,𝑚 − 𝑡 ′1,𝑚 |) The subsequence 𝛼 is denoted by 𝛼 ≼ 𝛽. The sequence β is called a supersequence of α. For example, sequence and sequence are subsequences of the supersequence . Normalized weight of sequence with time interval Let there be a set of items (symbols) I={i1, i2, …, in}. Each item 𝑖𝑗 ∈ 𝐼 is assigned a weight wj with j = 1,.....,n. The normalized weight of the sequence  = 〈(𝑡1,1 , 𝑠1 ), (𝑡1,2 , 𝑠2 ), … , (𝑡1,𝑚 , 𝑠𝑚 )〉 with length k is denote by NW(): 1 𝑁𝑊() = ∑ 𝑤𝑗 𝑘 𝑖𝑗 ∈  Normalized weight support of sequence with time interval Normalized weight of sequence with time interval is denoted by 𝑁𝑊𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝛼): 1 𝑁𝑊𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝛼) = 𝑁𝑊(𝛼) ∗ 𝑠𝑢𝑝(𝛼) = ( ∑ 𝑤𝑗 ) ∗ 𝑠𝑢𝑝(𝛼) 𝑘 𝑖𝑗 ∈𝛼 Item interval constraints Let S= be an extracted interval extended sequence. The four item interval constraints are defined as follows: • C1 = min_time_interval is a minimum item interval between any two adjacent items, ti,i+1C1. 6
  9. • C2 = max_time_interval is a maximum item interval between any two adjacent items, ti,i+1≤C2. • C3 = min_whole_interval is a minimum item interval between the head and tail of the sequence, ti,m  C3. • C4 = max_whole_interval is a maximum item interval between the head and tail of the sequence, ti,m ≤ C4. The weighted sequential pattern with time interval Given a weighted sequence database with time interval iSDB, each item ij  I is assigned a weight wj. Given a minimum threshold wminsup and four time constraints C1, C2, C3, C4. A sequence  is a weighted sequential pattern with time interval if it satisfies: NWSupport()  wminsup ∧  satisfies C1, C2, C3, C4 Prefix and Postfix of a sequence with time interval Given a sequence  = , and s is a non-empty subset of items sorted alphabetically or one item of I (𝑠𝑗 ⊆ 𝐼). Therefore, there exists some j (1 ≤ j ≤ m) such that s ⊆ sj và t1,  = t1, j We define a Prefix in the sequence with time interval of  with s,t1, parameters as follows: Prefix (,s, t1, ) = The Postfix in the sequence with time interval of  with s,t1, parameters as follows: Postfix (,s, t1, ) = Where s’j is a subset of sj after subtracting the set s. If s’j =  then Postfix of  with s,t1, parameters as follows: Postfix (,s, t1, ) = Projected database Given a sequence  = of the sequence database with time interval iSDB. A -projected database is denoted by iSDB|, that is the collection of postfixes of sequences in iSDB with regard to prefix . Candidate sequential pattern with time interval Given a support threshold wminsup. MaxW is the maximum value of weights of the items in iSDB. An α sequence is called candidate weighted sequential pattern with time interval if it satisfies: Sup(α) * MaxW  wminsup ∧ α satisfies C1, C2, C3, C4 2.2. Algorithm to mine top-k weighted sequential patterns with time interval 2.2.1. The problem posed Given a weighted sequence database with time interval iSDB, each item ij  I is assigned a weight wj. Given a number k and four time constraints C1, C2, C3, C4. The problem: Find the set of top-k sequential patterns where each sequence t is called a sequence of top-k weighted sequential patterns with time interval if there are less than k sequences with normalized weighted support is greater than NWSupport(t) and t satisfies the time constraints C1, C2, C3, C4. The optimal wminsup value is defined 7
  10. as: ε=min{NWSupport(t)|t∈ Τ }. Where Τ is the set of top-k weighted sequential patterns with time interval. 2.2.2. The algorithm idea TopKWFP algorithm finds top-k item-interval weighted frequent sequential patterns which use Prefixspan’s pattern-growth method. Firstly, wminsup is set to zero, then sequential patterns are found by applying pattern-growth method. Whenever a pattern is found, it will be inserted into an ordered-by-weighted-support list L. This list is used to maintain the top-k pattern on-the-fly. Once there are k patterns in the list L, the internal wminsup variable is raised to the weighted support of the pattern with the lowest weighted support in L. With this raising minimum weighted threshold wminsup strategy, the TopKWFP algorithm’s search space is reduced. After k patterns are found in list L and wminsup value is raised, the newly found pattern will be inserted to L if it has weighted support value higher than wminsup and the patterns with weighted support lower than new wminsup will be eliminated from L. The internal wminsup value is thereafter raised to the weighted support of the new pattern with the lowest weighted support in L,... The TopKWFP algorithm continues until there is no pattern found, then the algorithm is finished and output the set of top-K item-interval weighted frequent sequential patterns. However, an algorithm simply incorporating raising minimum weighted threshold strategy does not have good performance. To improve the performance of TopKWFP, we have added a second strategy: Generating the most promising candidates. It is to try to generate the most promising candidate sequential patterns first. The rationale of this strategy is that if patterns with high support are found earlier, it allows TopKWFP to raise its internal wminsup variable faster, and thus to prune a larger part of the search space. To implement this strategy, TopKWFP uses an internal variable R to maintain at any time the set of patterns that can be extended to generate candidates. TopKWFP then always extends the pattern having the highest support first. It is noticed that all pattern in the R was ordered by support instead of NWSupport, because R contains only candidate patterns but not frequent sequence patterns. 2.2.3. Algorithm TopKWFP Input: A sequence database with time interval iSDB, W: Weight value of each item i: wi  W, k: the number top-k sequences, C1, C2, C3, C4: Item interval constrains. Output: The set of top-k weighted sequential patterns with time interval The TopKWFP algorithm is shown: TopKWFP Algorithm 1. Start 2. R= ∅; L= ∅; wminsup=0; 3. Scan iSDB first time 4. Begin 5. Count the support of each item i in iSDB, denoted as sup(i); 6. MaxW = max{W(i)}; 7. End; 8. For earch item i in iSDB 8
  11. 9.  = < (0, i) >; 10. If sup()*maxW  wminsup then 11. R=R; 12. End if; 13. If sup()*NW() wminsup then 14. SAVE(,L,k,wminsup) 15. End if; 16.End For; 17. If k < number of all item i in iSDB then 18. Scan iSDB second time 19. Begin 20. Rebuild iSDB by elilminate all items i in iSDB don’t satisfy condition sup()*maxW  wminsup; 21. End; 22. End if; 23. While ∃𝑟 ∈ 𝑅 and sup(r)*maxW  wminsup do 24. r = the highest sup(r)*NW(r) in R; 25. Build r-projected database is iSDB|r; 26. PROJECTION(iSDB|r,W(i),C1,C2,C3,C4,wminsup,k); 27. Remove r from R; 28. Remove from R all item s which sup(s)*maxW
  12. 1. Start 2. L= L {r}; 3. If |L| > k then 4. If NWSupport(r) > wminsup then 5. While |L| > k and ∃s ∈ L| NWSupport(s) = wminsup 6. Remove s from L. 7. End while; 8. End if; 9. wminsup = the lowest NWSupport of pattern in L; 10. End if; 11. End. 2.2.4. Experimental results and evaluation 2.2.4.1. Datasets The information about the datasets is described in Table 2.1: Table 2.1. Datasets’characteristics Sequence Distinct item Avg sequence Dataset Type of data count count length Bible 36369 13905 21.64 book BMS- 59601 497 2.42 web click stream WebView1 FIFA 20450 2990 34.74 web click stream Leviathan 5834 9025 33.81 book Sign sign language 730 267 51.99 utterances Experiments were performed on a computer with a Core i7 processor running Windows 10 and 8 GB RAM. The TopKWFP algorithm was implemented in Java 1.8 2.2.4.2. Experimental results a) Influence of parameter k The TopKWFP algorithm run on each dataset with k varied from 1000 to 10000 to evaluate the influence of k. The four constraints were set as C1=0; C2= 5; C3= 0; C4=15. The results are shown in Figure 2.1. It can be seen that the running time of the algorithm increase rapidly when the parameter k is increased. This is because when increasing the parameter k, the search space is expanded, leading to an increase in the number of candidate sequence patterns, requiring the algorithm to perform more computations. 10
  13. Bible BMS WebView1 Fifa Leviathan Sign Figure 2.1 Influence of parameter k b) Compare performance of TopKWFP algorithm with WIPrefixSpan algorithm Comparative test of WIPrefixSpan algorithm and TopKWFP algorithm. The TopKWFP algorithm is compared with the WIPrefixSpan algorithm with the optimal support threshold. To do this, the TopKWFP algorithm is first run to find the optimal support threshold, then this threshold is used to run the WIPrefixSpan algorithm. The results are shown in Figure 2.2. Bible BMS WebView1 Fifa Leviathan Sign Figure 2.2. Compare two algorithms WIPrefixSpan and TopKWFP The TopKWFP algorithm has better performance than WIPrefixSpan in all test datasets and in some cases TopKWFP runs several times faster than WIPrefixSpan. By using the most promising sequence patterns strategy, the TopKWFP algorithm generates only the most promising sequence patterns (sequence pattern with the highest support) instead of having to expand to the entire search space as in WIPrexSpan. Conclusion Chapter 2 In this chapter, the thesis presented the mining top-k weighted sequential patterns in the sequence database with time interval. In the TopKWFP algorithm, each data item 11
  14. is assigned its own weight value, and the data between the sequences in the database has a time gap. The algorithm is developed based on the WIPrefixSpan algorithm which uses the projected database and the sequence pattern growth method and the TKS algorithm which is the top-k sequential pattern mining algorithm in the sequence database. The algorithm is correct and complete. The complexity of the algorithm is exponential O(nn) (n is the maximum sequence length). Algorithms following the sequential pattern growth method often try to reduce the data space to find sequential patterns according to the set goal, the evaluation of the effectiveness is based on the experimental results. The thesis has also experimented with the algorithm on real data. Experimental results show the correctness and feasibility of the proposed algorithm. The paper on the TopKWFP algorithm was published in the Journal of Computer Science and Cybernetics [CT1]. MINING HIGH UTILITY SEQUENTIAL PATTERNS IN SEQUENCE DATABASE WITH TIME INTERVAL In this chapter, the thesis will focus on presenting the mining of high utility sequential patterns in the quantitative sequence database with time interval, including the weight of each item in the sequence, the quantitative value of each item and the time interval between sequences in the quantitative sequence database have time intervals. Performing high utility sequential pattern mining on the quantitative sequence database with time interval was developed based on the WIPrefixSpan algorithm [40] proposed by Duong et al and the high utility sequential pattern mining algorithm like US, UL [42] was proposed by Ahdmed et al. Research results and proposed UIPrefixSpan algorithm for mining high utility sequential patterns with time interval were published in the Proceedings of National Conference Selected Problems of Information Technology & Communication [CT2] and the Journal of Cybernetics and Information Technologies [CT3]. After that, PhD student continued to study the PHUS algorithm [44] proposed by Lan et al in high utility sequential pattern mining in quantitative sequence databases, proposing a 1-phase approach by calculating using the benefit table, the index table, and the research results to propose the HUISP algorithm to mine the high utility sequential pattern with the time interval were published in the Proceedings of National Conference Selected Problems of Information Technology & Communication [CT4 ] and the Journal of Computer Science and Cybernetics [CT5]. 3.1. The basic concepts Internal utility and external utility Internal utility of an item 𝑖𝑗 ∈ 𝐼 in a sequence Sj, denoted as iu(𝑖𝑗 ,Sj), is quantity of item 𝑖𝑗 in Sj. External utility of item 𝑖𝑗 is its significant value and denoted as eu(𝑖𝑗 ). The utility of an item The utility of an item 𝑖𝑗 in a sequence Sj denoted as su(𝑖𝑗 , Sj) is defined by : su(𝑖𝑗 , Sj) = iu(𝑖𝑗 , Sj) * eu(𝑖𝑗 ). 12
  15. The utility of a sequence Given a sequence 𝛼 = 〈(𝑡1,1 , 𝑋1 ), (𝑡1,2 , 𝑋2 ), … , (𝑡1,𝑛 , 𝑋𝑛 )〉 with length n and 𝛼 ≼ 𝑆𝑗 . The utility of the sequence α in Sj denoted as su(α,Sj) is defined by: 𝑠𝑢(α, 𝑆𝑗 ) = max {∑ 𝑠𝑢(i, 𝑆𝑗 ) , ∀α ≼ 𝑆𝑗 } 𝑖∈α The utility of an input sequence The utility of an input sequence Sj is the sum of utilities of all items in Sj, which means: 𝑠𝑢(𝑆𝑗 ) = ∑ 𝑠𝑢(i,𝑆𝑗 ) 𝑖∈𝑆𝑗 The utility of a sequence in the quantitative sequence database with time intervals The utility of the sequence α in a QiSDB denoted as su(α, QiSDB), is defined by: 𝑠𝑢(α, 𝑄𝑖𝑆𝐷𝐵) = ∑ 𝑠𝑢(α,𝑆𝑗 ) 𝑆𝑗 ∈𝑄𝑖𝑆𝐷𝐵 The utility of a quantitative sequence database with time intervals The utility of a QiSDB is defined by 𝑠𝑢(𝑄𝑖𝑆𝐷𝐵) = ∑ 𝑠𝑢(𝑆𝑗 ) 𝑆𝑗 ∈𝑄𝑖𝑆𝐷𝐵 Item interval constraints Let S= be an extracted interval extended sequence. The four item interval constraints are defined as follows: • C5 = min_time_interval is a minimum item interval between any two adjacent items, ti,i+1C5. • C6 = max_time_interval is a maximum item interval between any two adjacent items, ti,i+1≤C6. • C7 = min_whole_interval is a minimum item interval between the head and tail of the sequence, ti,m  C7. • C8 = max_whole_interval is a maximum item interval between the head and tail of the sequence, ti,m ≤ C8. The high utility sequential pattern with time interval Given a quantitative sequence database with time intervals QiSDB, each item 𝑖𝑗 ∈ 𝐼 in the input sequence Sj is assigned with an internal utility iu(𝑖𝑗 ,Sj) and an external utility eu(𝑖𝑗 ). Given a minimum utility threshold minUtil and four item interval constraints C5, C6, C7, C8 . A sequence α is a high utility sequential pattern with time interval if it satisfies: su(α,QiSDB)  minUtil ∧  satisfies C5, C6, C7, C8 Sequence weight utility swu Given a sequence α, the swu of α is defined as follows : 13
  16. 𝑠𝑤𝑢(𝛼, 𝑄𝑖𝑆𝐷𝐵) = ∑ 𝑠𝑢(𝑆𝑎 ) 𝛼 ≼𝑆𝑎 ∧ 𝑆𝑎 ∈𝑄𝑖𝑆𝐷𝐵 Candidate high utility sequential pattern with time interval Given a minimum threshold minUtil, a sequence α is a candidate high utility sequential pattern with time interval if it satisfies: swu(α, QiSDB)  minUtil ∧ α satisfies C5, C6, C7, C8. 3.2. Algorithm to mine high utility sequential patterns with time interval (UIPrefixSpan) 3.2.1. The problem posed Given a quantitative sequence database with time interval QiSDB, each item ij  I in the input sequence S is assigned with an internal utility and an external utility. Given a minimum utility threshold minUtil and four time constrains C5, C6, C7, C8. The problem posed: Find all high utility sequential patterns with time interval in QiSDB which means finding the set L such that : 𝐿 = { ≼ 𝑆| 𝑠𝑢(, QiSDB) ≥ 𝑚𝑖𝑛𝑈𝑡𝑖𝑙 ∧  satisfies 𝐶5 , 𝐶6 , 𝐶7 , 𝐶8 ∧ 𝑆 ∈ 𝑄𝑖𝑆𝐷𝐵} 3.2.2. The algorithm idea The UIPrefixSpan algorithm first initializes the values empty of the high utility sequence pattern candidate set R and the set of high utility sequential patterns L. Then scan the QiSDB database for the first time to find all items i that are candidates of length = 1 in the database. For each item i satisfying swu(i,QiSDB)  minUtil concatenate the value =, put  into the candidate set R. Perform recursion to the subUIPrefixSpan procedure to mine high utility sequential patterns on projected databases with prefix . Next, scan the QiSDB database for the second time to check the condition su(,QiSDB)  minUtil with all sequence  in the candidate set R found. If it is satisfied, then put into the result set L. 3.2.3. Algorithm UIPrefixSpan The UIPrefixSpan algorithm is shown: Algorithm 3.1. UIPrefixSpan Algorithm Algorithm UIPrefixSpan (QiSDB,W(i),minUtil,C5,C6,C7,C8) 1. Start 2. α = ; 3. R = ; L = ; 4. Scan QiSDB first time 5. Begin 6. For earch item i in QiSDB 7.  = < (0, i) >; 8. If swu(,QiSDB)  minUtil then 9. R=R; 10. End if; 11 R= subUIPrefixSpan (QiSDB|α,R,minUtil, C5, C6, C7, C8); 12. End For; 14
  17. 13. End; 14. Scan QiSDB second time 15. Begin 16. Calcutate su(,QiSDB) each sequence   R; 17. If su(α,QiSDB)  minUtil then 18. L=L; 19. End if; 20. End; 21. Rerurn L; 22. End. Procedure subUIPrefixSpan is shown: Procedure subUIPrefixSpan(QiSDB|r,R,minUtil,C5,C6,C7,C8) 1. Start 2. Scan QiSDB|r to find all pairs of item (∆t;i) that satisfy swu((∆t;i),QiSDB|r)minUtil, C5 và C6, with i is an item data and ∆t is item interval between r and i; 3. For each (∆t; i) 4. r = ; 5. If r satisfies C8 then 6. R= subUIPrefixSpan (QiSDB|r,R,minUtil, C5, C6, C7, C8); 7. If r satisfies C7 then 8. R=Rr; 9. End if; 10. End if; 11. End For; 12. Rerurn R; 13. End. 3.2.4. Experimental results and evaluation 3.2.4.1. Datasets The information about the datasets is described in Table 3.1: Table 3.1. Datasets’characteristics Distinct Avg Sequence Dataset item sequence Type of data count count length D10K.C9.T8.S7.I8.N1K (DS1) 10000 1000 8 IBM data generator D200K.C10.T9.S9.I7.N1K 200000 1000 7 IBM data generator (DS2) BMS-WebView1 (DS3) 59601 497 2.42 web click stream Experiments were performed on a computer with a Core i7 processor running Windows 10 and 8 GB RAM. The TopKWFP algorithm was implemented in Java 1.8 15
  18. 3.2.4.2. Experimental results a) Evaluate the relationship between the minimum threshold and implementation time Testing the performance of the UIPrefixSpan algorithm in two cases: Using time constraints C5=0; C6=5; C7=0; C8=20 (UIPrefixSpan1) and do not use the time constraint C5=0; C6=0; C7=0; C8=0 (UIPrefixSpan2). D10K.C9.T8.S7.I8.N1K D200K.C10.T9S9.I7.N1K BMS-WebView-1 Figure 3.1 Runtime UIPrefixSpan Figure 3.1: Runtime comparison shows that UIPrefixSpan1 is faster and more efficient than UIPrefixSpan2. When the minUtil value decreases, the running time of UIPrefxiSpan2 increases significantly. In contrast, UIPrefixSpan1 runs well with low minUtil and is much faster than UIPrefixSpan2 in both synthetic (DS1-DS2) and real (DS3) datasets. That's because when using the time limit (UIPrefixSpan1), fewer candidates are generated, so the search space is reduced and the running time is reduced. b) Evaluate the relationship between minimum threshold and memory usage D10K.C9.T8.S7.I8.N1K D200K.C10.T9S9.I7.N1K BMS-WebView-1 Figure 3.2 Memory UIPrefixSpan Figure 3.2: Memory efficiency comparison: UIPrefixSpan1 also uses less memory than UIPrefixSpan2. On DS1, UIPrefixSpan1 uses 1.2 times less memory than UIPrefixSpan2 and in some cases (with minUtil 2%, 9% and 10%) UIPrefixSpan1 algorithm uses up to 2.2 times less memory. On DS2, UIPrefixSpan1 uses 1.4 times less memory than UIPrefixSpan2, and with low minUtil (
  19. using the time limit, the search space is reduced and that causes our algorithm to use less memory. c) Evaluate the relationship between the minimum threshold and the high utility sequential patterns number D10K.C9.T8.S7.I8.N1K D200K.C10.T9.S9.I7.N1K BMS-WebView-1 Figure 3.3 The high utility sequential patterns number UIPrefixSpan Figure 3.3: Shows the number of high utility sequential patterns found in UIPrefixSpan1 less than UIPrefixSpan2, but those patterns are more significant by time constraints. By using time constraints, the creation of less meaningful patterns can be avoided. 3.3. Algorithm to mine high utility sequential patterns with time interval (HUISP) 3.3.1. The problem posed Given a quantitative sequence database with time interval QiSDB, each item ij  I in the input sequence S is assigned with an internal utility and an external utility. Given a minimum utility threshold minUtil and four time constrains C5, C6, C7, C8. The problem posed: Find all high utility sequential patterns with time interval in QiSDB which means finding the set L such that : 𝐿 = { ≼ 𝑆| 𝑠𝑢(, QiSDB) ≥ 𝑚𝑖𝑛𝑈𝑡𝑖𝑙 ∧  satisfies 𝐶5 , 𝐶6 , 𝐶7 , 𝐶8 ∧ 𝑆 ∈ 𝑄𝑖𝑆𝐷𝐵} 3.3.2. The algorithm idea The HUISP algorithm initializes the values empty of the high utility sequence patterns candidate set R, the set of the high utility sequential patterns L. Then scan the QiSDB database for the first time to calculate the utility values of each item i in the input sequence su(iSa), and the utility of each input sequence su(iSa). Then build a utility table for all items i in QiSDB. Scan the utility table, with each item i pairing the value  = < (0, i) > satisfying swu()  minUtil, put into the candidate set R. Remove items not in R from the database QiSDB. Further check the condition su(,QiSDB)  minUtil, if satisfied, put  in the result set L. Next, build an index table for each sequence in the candidate set R. For each sequence  in R, based on into the index table, construct a projected database for  including all the input ranges of QiSDB that contain . Recursively execute subHUISP procedure to mine high utility sequential patterns on projected databases with prefix . 3.3.3. Algorithm HUISP HUISP algorithm is shown: Algorithm 3.2. HUISP algorithm 17
  20. Algorithm HUISP (QiSDB,W(i),minUtil,C5,C6,C7,C8) 1. Start 2. α = ; 3. R = ; L = ; 4. Scan QiSDB 5. Begin 6. Calculate the utilities of all items in each input sequence su((0,i),iSa); 7. Calculate the utilities of each input sequence su(iSa); 8. End; 9. Build the utility table B for all sequences  = < (0, i) > in QiSDB; 10. Scan utility table B; 11. Begin 12. With each sequence  = < (0, i) >,  ∈ B 13. If swu ()  minUtil then 14. R=R; 15. End if; 16. If su(α,QiSDB)  minUtil then 17. L=L; 18. End if; 19. End; 20. Eliminate all items i which  = < (0, i) > not belong R from QiSDB; 21. Build the index table C for each item i in the sequence  = < (0, i) > in R from QiSDB; 22. Scan index table C; 23. Begin 24. With each item i, i ∈ C 25. Build α-projected database QiSDB|α with α= include all input sequence iSa of QiSDB which contains α; 26. Recalculate the utilities of each input sequence su(iSa) in QiSDB|α. 27. subHUISP(α,QiSDB|α ,R,minUtil, C5, C6, C7, C8); 28. End; 29. Rerurn L; 30. End. Procedure subHUISP is shown: Procedure subHUISP(r,QiSDB|r,R,minUtil,C5,C6,C7,C8) 1. Start 2. Scan QiSDB|r to find all pairs of item (∆t;i) that satisfy swu(,QiSDB|r)minUtil, C5 và C6, with i is an item data and ∆t is item interval between r and i; 18
ADSENSE

CÓ THỂ BẠN MUỐN DOWNLOAD

 

Đồng bộ tài khoản
2=>2