Summary of Computer and Information technology doctoral thesis: Research on developing method of mining fuzzy association rules based on linguistic information and its application
lượt xem 4
download
Study methods of semantically expressing fuzzy concepts based the MF or other mathematical methods providing that they express the semantics of the most suitable concepts; studying methods of knowledge exploitation in general and fuzzy rules in particular; sesearch different data representations of information so that it can be exploited in ARs in a diverse and meaningful way.
Bình luận(0) Đăng nhập để gửi bình luận!
Nội dung Text: Summary of Computer and Information technology doctoral thesis: Research on developing method of mining fuzzy association rules based on linguistic information and its application
- MINISTRY OF EDUCATION VIETNAM ACADEMY OF AND TRAINING SCIENCE AND TECHNOLOGY GRADUATE UNIVERSITY OF SCIENCE AND TECHNOLOGY ------------------------------- NGUYEN TUAN ANH RESEARCH ON DEVELOPING METHOD OF MINING FUZZY ASSOCIATION RULES BASED ON LINGUISTIC INFORMATION AND ITS APPLICATION Major: Mathematical Foundation for informatics Code: 62 46 01 10 SUMMARY OF COMPUTER AND INFORMATION TECHNOLOGY DOCTORAL THESIS HA NOI, 2020
- List of works of author Trần Thái Sơn, Nguyễn Tuấn Anh, “Nâng cao hiệu quả khai phá luật kết hợp mờ theo hướng tiếp cận đại số gia tử", Kỷ yếu hội nghị quốc gia lần VI 1 về nghiên cứu cơ bản và ứng dụng công nghệ thông tin (Fair) - Huế, 6/2013. Tran Thai Son, Nguyen Tuan Anh, “Improve efficiency fuzzy association 2 rule using hedge algebra approach, Journal of Computer Science and Cybernetics, Vol 30, No 4, 2014. Tran Thai Son, Nguyen Tuan Anh, Hedges Algebras and fuzzy partition 3 problem for qualitative attributes, Journal of Computer Science and Cybernetics, V.32, N.4, 2016. Tran Thai Son, Nguyen Tuan Anh, Partition fuzzy domain with multi- 4 granularity representation of data based on Hedge Algebra approach, Journal of Computer Science and Cybernetics, vol. 33, pp. 63-76, 2017.
- 1 INTRODUCTION Nowadays, there is an important research direction on which the problem of mining association rules (AR) is concerned and is soon developed in the direction of data mining. Specially, many algorithms have been developed in different directions but mainly focused on two main directions: (i) Improve the average speed of rule mining algorithms because this is an exponentially complex problem due to repeated database (DB) scans. (ii) Further research on the meaning of mining rules means as not all mining rules have implications for users. Fuzzy AR take the form: "If X is A, then Y is B". "X is A" is called Premise, "Y is B" is called the conclusion of the rule. 𝑋 = {𝑥1 , 𝑥2 , … , 𝑥𝑝 }, Y= {𝑦1 , 𝑦2 , … , 𝑦𝑞 } is a subsection of a set of attributes I of the DB. 𝐴 = {𝑓𝑥1 , 𝑓𝑥2 , … , 𝑓𝑥𝑝 }, B= {𝑓𝑦1 , 𝑓𝑦2 , … , 𝑓𝑦𝑞 } are the corresponding fuzzy sets of attributes X and Y. Dividing the domain of the attribute is an important first step for an information processing process. Recently, researchers have paid attention to the construction of such membership functions (MF) because of the obvious influence of this step on the next step. The thesis studies the methods of exploiting legal knowledge in fuzzy combination with linguistic information (linguistic rule) from DB or digital data stores. Hedge algebras are used (HA) instead of fuzzy set theory to study some problems of AR mining: (i) The fuzzy AR is studied and there are some disadvantages, including in constructing algorithms to increase processing speed as well as in the problem of fuzzy partition planning of the attributes in order to find out the meaningful ARs. (ii) With different data representations, HA gives a simple unified approach that is highly effective in processing. Research purposes: - Study methods of semantically expressing fuzzy concepts based the MF or other mathematical methods providing that they express the semantics of the most suitable concepts. - Studying methods of knowledge exploitation in general and fuzzy rules in particular. - Research different data representations of information so that it can be exploited in ARs in a diverse and meaningful way. The thesis uses single-granular and multi-granular data representation, consistent with the increasing attention of this research direction. CHAPTER 1. SOME BASIC KNOWLEDGE 1.1. Fuzzy set and operations on fuzzy sets 1.1.1. Fuzzy set Definition 0.1: Let U be the universe of objects. The fuzzy set A on U is a set of ordered pairs (𝑥, 𝜇𝐴 (𝑥)), where 𝜇𝐴 (𝑥) is a function from U to [0, 1] assigned to each x of U a value 𝜇𝐴 (𝑥) reflects degree of dependence x to the fuzzy set A. 1.1.2. Linguistic variable 1.1.3. Fuzzy partition There are several methods for partitioning domain values using fuzzy logic as follows: 1) Definition 1.3: Given m fixed items 𝑝1 , 𝑝2 , … , 𝑝𝑚 that belongs to 𝑈 = [𝑎, 𝑏] ⊂ 𝑅 . It is the reference space of the basic variable 𝑢 of the llinguistic variable 𝑋. Then a set 𝑇 of m fuzzy sets 𝐴1 , 𝐴2 , … , 𝐴𝑚 defined on 𝑈 (with corresponding MFs 𝜇𝐴1 , 𝜇𝐴2 ,..., 𝜇𝐴𝑚 are
- 2 called a fuzzy partition of 𝑈 if the following conditions are met. complacent, ∀𝑘 = 1, … , 𝑚:𝜇𝐴𝑘 (𝑝𝑘 ) (𝑝𝑘 belongs to the part called core of 𝐴𝑘 ); 2) If x ∉ [𝑝𝑘−1 , 𝑝𝑘+1 ] then 𝜇𝐴𝑘 (𝑥) = 0 3) 𝜇𝐴𝑘 (𝑥) consecutively; 4) 𝜇𝐴𝑘 (𝑥) monotonous rose on[𝑝𝑘−1 , 𝑝𝑘 ]; 5) ∀𝑥 ∈ 𝑈, ∃𝑘, so that 𝜇𝐴𝑘 (𝑥) > 0; If the fuzzy partition satisfies the following conditions 6), it is called a strong fuzzy partition. 6) ∀𝑥𝜖𝑈, ∑𝑚 𝑘=1 𝜇𝐴𝑘 (𝑥) = 1; If a fuzzy partition satisfies the following conditions 7), 8), 9), it is called a uniform partition. 7) If 𝑘 ≠ 𝑚 then ℎ𝑘 = 𝑝𝑘+1 − 𝑝𝑘 = a constant 8) The fuzzy set 𝜇𝐴𝑘 (𝑥) is the symmetric function. 1.2. The fuzzy sets 𝝁𝑨𝒌 (𝒙) have the same geometry - Hedge algebra 1.3. The concept of hedge algebra 1.3.1. The concept of hedge algebra Definition 1.4: A HA is denoted as a set of 4 components denoted 𝐴𝑋 = (X, G, H, ≤) in which G is a set of generating elements, H is a set of hedges, and "≤" is a semantically inductant relationship on X. Suppose that there are constant elements 0, W, 1 in G corresponding the smallest element, the largest element and the neutral element in X. We call each linguistic value 𝑥 ∈ 𝑋 a term in HA. The set of H contains H − = {ℎ−1 < ℎ−2 < ⋯ < ℎ−𝑞 } and 𝐻 + = {ℎ1 < ℎ2 < ⋯ < ℎ𝑝 }. 1.3.2. Quantitative semantics of linguistic values Definition 1.5: Suppose AX = (𝑋, 𝐺, 𝐻, ≤) is a linear HA. The mapping 𝑣𝔵 : 𝑋 → [0,1] is called a semantic quantitative function of AX if: 𝑣𝔵 is the light one to one from the set X on the fragment [0,1] and preserves the order on X, ie ∀x, y∈X, x
- 3 𝑖−𝑠𝑖𝑔𝑛(𝑗) (2) 𝑣𝔵 (ℎ𝑗 𝑥) = 𝑣𝔵 (𝑥) + 𝑆𝑖𝑔𝑛(ℎ𝑗 𝑥) {∑𝑖=𝑠𝑖𝑔𝑛(𝑗) 𝜇(ℎ𝑖 )𝑓𝑚(𝑥) − 𝜔(ℎ𝑗 𝑥)𝜇(ℎ𝑗 )𝑓𝑚(𝑥)}, 1 where j, −𝑞 ≤ 𝑗 ≤ 𝑝 and 𝑗 0, and 𝜔(ℎ𝑗 𝑥) = [1 + 𝑆𝑖𝑔𝑛(ℎ𝑗 𝑥) 𝑆𝑖𝑔𝑛(ℎ𝑝 ℎ𝑗 𝑥) (𝛽 − 𝛼)] ∈ 2 {𝛼, 𝛽}; 1.4. Mining association rules 1.4.1. Some concepts Suppose that 𝐼 = {𝐼1 , 𝐼2 , . . , 𝐼𝑚 } is a set of m distinct properties. Assume D is a DB whose records contain a subset T of properties, each record has its own index. An AR is an associated clause of the form 𝑋 → 𝑌, where 𝑋, 𝑌 ⊆ 𝐼, satisfying the condition 𝑋 ∩ 𝑌 = ∅. The sets X and Y are called itemsets. Definition 1.10: An AR is an associated clause of the form 𝑋 → 𝑌, where 𝑋, 𝑌 𝐼 and X, Y are called itemsets, satisfying the condition 𝑋 ∩ 𝑌 = ∅. The set X is called the cause, and Y is the consequence one There are 2 important measurements for AR: Support and confidence. Definition 0.11: The support of the itemset X is the rate of the number of records in D containing the itemsets X to the number of records in D. |𝑋| (1.1) 𝑆𝑢𝑝𝑝(𝑋) = |𝐷| Definition 1.12: The support of the rule 𝑋 → 𝑌: The support of an association rule 𝑋 → 𝑌 is the ratio between the number of records containing the set 𝑋 ∪ 𝑌, compared to the total number of records in D. |𝑋∪𝑌| (1.2) 𝑆𝑢𝑝𝑝(𝑋 → 𝑌) = 𝑃(𝑋 ∪ 𝑌) = |𝐷| Definition 1.13: The confidence of the rule 𝑋 → 𝑌: The confidence of an association rule X → Y is the ratio between the number of records in D containing 𝑋 ∪ 𝑌 and the number of records in D containing the set X. 𝑆𝑢𝑝𝑝(𝑋 ∪ 𝑌) (1.3) 𝑐𝑜𝑛𝑓(𝑋 → 𝑌) = 𝑆𝑢𝑝𝑝(𝑋) 1.4.2. The problem of fuzzy association rule Let 𝐷𝑇 = {𝑡1 , 𝑡2 , … , 𝑡𝑛 } be the transaction database, n is the total number of records in D. Let 𝐼 = {𝑖1 , 𝑖2 , … , 𝑖𝑚 } be the items, each item 𝑖𝑗 (1 ≤ 𝑗 ≤ 𝑚) is an item attribute or a quantitative attribute. A fuzzy attribute set is a pair of 〈𝑍, 𝐶〉 where Z corresponds to a set of attributes zj and C corresponds to a set of fuzzy sets 𝑐𝑗 . If fuzzy association rule 𝑋 𝑖𝑠 𝐴 ⇒ 𝑌 𝑖𝑠 𝐵, it is called trust if the support of 𝐹(𝑍,𝐶) and confidence 𝐹𝐶((𝑋,𝐴),(𝑌,𝐵)) , with 𝑍 = 𝑋 ∪ 𝑌, 𝐶 = 𝐴 ∪ 𝐵. The support of a itemsets 〈𝑍, 𝐶〉 denoted by 𝑓𝑠(〈𝑍, 𝐶〉)is determined by the following formula: ∑𝑛𝑖=1 ∏𝑚 𝑖 𝑗=1(𝑡 [(𝑥𝑗 , 𝑎𝑗 )]) (1.4) 𝑓𝑠(〈𝑍, 𝐶〉) = 𝑛 Where m is the number of itemsets in the itemset (𝑍, 𝐶). Fuzzy confidence is determined by the following formula: 𝑓𝑠(𝑍, 𝐶) 𝐹𝐶((𝑋,𝐴),(𝑌,𝐵)) = (1.5) 𝑓𝑠(< 𝑋, 𝐴 >) The algorithm of fuzzy association rule mining based on Apriori algorithm:
- 4 This algorithm is divided into two phases as follows: Phase 1: Find all common fuzzy attribute sets of the form 〈𝑍, 𝐶〉 with support level greater than the minimum support of the user in 𝑓𝑠(〈𝑍, 𝐶〉) ≥ 𝑓𝑚𝑖𝑛𝑠𝑢𝑝 Phase 2: Generating reliable fuzzy AR from common itemsets found in the first phase. This phase is simple and takes less time than the previous phase. If 〈𝑍, 𝐶〉 is a common fuzzy attribute set, the association rule generated from X has the form: Z ′ is C ′ fc → Z\Z ′ is C\C ′. Where Z' is a non-empty subset of Z, Z\Z' is the difference of two sets, C ' is a non- empty subset of C, a fuzzy subset of the attributes in Z', C\C 'is the difference of two sets, 𝑓𝑐 is the confidence of the rule providing that: 𝑓𝑐 ≥ 𝑓𝑚𝑖𝑛𝑐𝑜𝑛𝑓. 1.5. Conclusion In this chapter, the thesis summarizes the basic backgrounds that serve as a basis for the research process. It contains the following main contents: - Fuzzy set theory includes fuzzy set concepts, methods for building fuzzy sets, linguistic variables, fuzzy partitions, etc. - The theoretical system of HA with basic concepts such as: HA, linear HA, full linear HA, fuzzy measure of hedges, generating element, methods of determining quantitative values of linguistic elements, fuzzy range, ... - Some basic concepts of AR, fuzzy AR and some researches on mining fuzzy AR. - The basic knowledge presented in the program is sufficient foundation to implement the objectives set out in the thesis. CHAPTER 2. MINING FUZZY AR BASED ON HEDGE ALGEBRA In this chapter, the thesis proposes the application of HA and the solution to compress fuzzy transaction DB to create a new transaction DB with a smaller size. With this method, it helps to find AR in linguistic form that are close to people and reduce the time to mine AR. 2.1. Introduction Recently, algorithms that use data compression in binary DB provide a good solution that can reduce storage space and data processing time. Jia - Yu Dai (2008) proposed an algorithm to compress binary trading DB called M2TQT. The basic idea of this algorithm is to combine transactions that are closely related to form a new transaction, the result is to create a new DB with a smaller size, which can reduce data processing time and storage space. M2TQT algorithm is rated better than the previously proposed methods. However, M2TQT algorithm only works with binary DB. In order to improve the efficiency of mining the fuzzy AR, the thesis proposes a method of mining fuzzy AR based on HA approach, using data compression for any DB. With this approach, close transactions are combined to form new transactions, reducing the size (horizontal) of the input DB. Experiments show that this approach outperforms existing approaches. In the content of this chapter, the author presents how to fuzzy attributes based on HA approach, fuzzy DB compression algorithm, and fuzzy AR mining algorithm. 2.2. Mining fuzzy AR based on hedge algebra 2.2.1. Fuzzy transaction database On HA approach, the MF values of each DB value are calculated as shown below: First, the attribute value of each fuzzy domain is regarded as a HA. With the problem of
- 5 mining fuzzy AR based fuzzy set theory, we have to construct MFs for each attribute. Then, based on the built-in MF to calculate the dependence of the values and corresponding fuzzy domains. The thesis proposes that each quantitative attribute will use a HA structure. Based on the semantic value of elements of HA, we build up fuzzy partitions to calculate the dependence of elements in the DB to fuzzy domains. Instead of building a MF of the fuzzy set, a quantitative semantic value is used to determine the degree of membership value in any row in fuzzy sets defined above. Step 1: Standardize values of the fuzzy attribute between [0, 1]. Step 2: Considering the fuzzy domains 𝑠𝑗 of attribute 𝑥𝑖 be elements of HA 𝐴𝑋𝑖 . Then, x any value dj i of 𝑥𝑖 is between 2 semantic quantitative values of 2 elements of 𝐴𝑋𝑖 . The x distance in the range [0,1] between dj i and the semantic quantitative value of the two x x closest elements dj i can be used to determine the proximity of dj i into two fuzzy domains (two elements of HA). The x proximity between dj i and other elements of HA is determined to be 0. In order to determine the final membership, we must normalize (convert to the value in [0,1] and subtract 1 from 1). Fig 0.1: Building fuzzy partition based on HA distance normalized). We will x x have a pair of membership for every value dj i . Therefore, to calculate the attribute dj i of attribute 𝑥𝑖 into fuzzy domain 𝑠𝑗 : x x 𝜇𝑠𝑗 (dj i ) = 1 − | 𝑣(𝑠𝑗 ) − dj i |, where 𝑣(𝑠𝑗 ) is the semantic quantitative value of the element 𝑠𝑗 . Table 0.1: Example on DB We have fuzzy values as shown in Table 2.2. TID A B Sign: A1, B1: Very Low; A2, B2: Least Low; A3, 𝑇1 30 40 B3: Least High, A4, B4: Very High; 𝑇2 41 48 𝑇3 45 32 Table 0.2: Fuzzy data in Table 0.1 A B TID A1 A2 A3 A4 B1 B2 B3 B4 𝑇1 0.825 0.925 0 0 0 0.975 0.775 0 𝑇2 0 0.965 0.785 0 0 0.895 0.855 0 𝑇3 0 0.925 0.825 0 0.805 0.945 0 0 Example 0.1: Suppose a DB, for example, in Table 0.1 there are two attributes A and B. The HA used for these two properties have the same structure: 𝐴𝑋 = (𝑋, 𝐺, 𝐻, ≤), 𝐶 − = {𝐿𝑜𝑤}, 𝐶 + = {𝐻𝑖𝑔ℎ𝑡}, 𝐻 − = {𝐿𝑒𝑎𝑠𝑡}, 𝐶 + = {𝑉𝑒𝑟𝑦}, parameters are as follows: 𝑓𝑚(𝐿𝑜𝑤) = 𝑓𝑚(𝐻𝑖𝑔ℎ𝑡) = 0.5, 𝜇(𝑉𝑒𝑟𝑦) = 𝜇(Least) = 0.5, 𝐷𝑜𝑚(𝐴, 𝐵) = [0, 100]. Then the semantic quantitative values can be computed: v(Very Low) = 0.125, v(Least Low) = 0.375, v(Least Height) = 0.625, v(Very Height) = 0.875. 2.2.2. Relationship distance transactions
- 6 Based on the distance between transactions, it is possible to combine transactions that are close to each other to create a transaction group. As a result, a new DB of a smaller size is obtained. Transaction relation and distance relationship for fuzzy DB transactions are defined as follows: (1) Transaction relationship: Two transactions 𝑇1 , 𝑇2 are called to be related to each other if 𝑇1 or subset of 𝑇2 or 𝑇1 is the universal set of 𝑇2 𝑜𝑟 𝑇1 (2) Transaction distance Table 2.3: Quantification table for the relationship: The distance between two database of table 2.2 transactions is the number of different items (items). In Table 2.2, the distance between transactions 𝑇1 and 𝑇2 is 𝐷𝑇1−𝑇2 = 2,, the distance between two transactions 𝑇2 and 𝑇3 is 𝐷𝑇1−𝑇3 = 4. Construct the quantitative table To reduce the number of candidate sets created, more information is needed to eliminate non-popular sets. Quantitative tables are constructed to store this information as each transaction is processed. The items appearing in the transaction need to be arranged in a dictionary order. It is started with the items on the left and They are called the prefix of the item. Then, the length of the input transaction as n is computed and the number of items appearing in the transaction to the items are written depending on the length of the transaction: L𝑛 , Ln−1 , . . . , L1 . The quantitative table includes items where each Li contains an item prefix and its supporting value. Table 2.3 is a quantitative table constructed by the DB in Table 2.2. With the quantitative table, it is easy to remove candidate sets with less support than the minimum support. 2.3. Transaction database compression Let d be the relationship distance initialized by 1 and based on the distance between transactions, we combine transactions with a distance less than or equal to d to form a new transaction group and put it into the block of transactions. Translations are mixed together. In Figure 2.2: The DB includes quantitative attributes, the Preprocessing section: Performing standardization of data on the range [0,1], the attribute value of the calculated properties is shown in Section 2.2, then from the obtained fuzzy DB we merge close transactions together to create a new DB called compressed DB. The details of the compression algorithm are presented in detail in Algorithm 1. To find AR from the compression DB the thesis proposes to improve the fuzzy Apriori algorithm and its details such as Algorithm 2. Algorithm 1: Algorithm of transaction compression Input: Fuzzy transaction database D Output: Compressed database The notations of parameters in the algorithm as follows: 𝑀𝐿 = {𝑀𝐿𝑘 }: 𝑀𝐿𝑘 the transaction group having the length k (the length of a transaction is the number of items in this transaction); 𝐿 = {𝐿𝑘 }: 𝐿𝑘 is the transaction with the length k; 𝑇𝑖 : ith transaction in fuzzy DB; | 𝑇𝑖 |: The length of transaction 𝑇𝑖
- 7 Algorithm’s content: Step 1: Read one transaction Ti at a time from fuzzy database Step 2: Computing the length of the transaction Ti Step 3: Based on an input transaction, the qualification table is built. Step 4: Computing the distance between transactions Ti and the transaction group in blocks MLn−1 , MLn , MLn+1 . If there is an existence of a transaction group in the blocks MLn−1 , MLn , MLn+1 , the distance to the transaction Ti will be less than or equal to d. Then the transaction Ti is merged into the relevant transaction group. The old transaction group will be removed. Step 5: If the transaction 𝑇𝑖 is not merged with the transaction group in the blocks MLn−1 , MLn , MLn+1 . Computing the distance between transactions Ti and transactions in the blocks 𝐿𝑛−1 , 𝐿𝑛 , 𝐿𝑛+1 . If there is an existence of the transaction 𝑇𝑗 so that 𝐷𝑇𝑖−𝑇𝑗 ≤ 𝑑, merging the transaction 𝑇𝑖 to the transaction 𝑇𝑗 in order to form a new transaction group and add more this transaction group into respective blocks (depending on the length of the transaction group created), and remove the transaction 𝑇𝑗 in the blocks: 𝐿𝑛−1 , 𝐿𝑛 , 𝐿𝑛+1 . If there is not an existence of any transaction satisfying the distance d, the transaction 𝑇𝑖 will be added to the block 𝐿𝑛 . Step 6: Repeat 5 above steps until the final transaction is read. Step 7: Read one transaction 𝑇𝑖 at a time from 𝐿 = {𝐿𝑘 } Step 8: Computing the length of the transaction 𝑇𝑖 : n Step 9: Computing the distance of the transaction 𝑇𝑖 and transaction groups in the blocks MLn−1 , MLn , MLn+1 . If there exists a group of transactions with distance less than or equal to the d, the transaction Ti would merge into the group to create a new transaction group. Based on the length of the new transaction group, we add this transaction group into the respective blocks: MLn−1 , MLn , MLn+1 , remove the old transaction group in the blocks: MLn−1 , MLn , MLn+1 , and remove the transaction 𝑇𝑖 in the block 𝐿𝑛 . Step 10: Repeat the step 7, step 8 and step 9 until the final transaction in 𝐿 = {𝐿𝑘 } is read. Finally, the obtained compressed database includes 𝐿 = {𝐿𝑘 }, 𝑀𝐿 = {𝑀𝐿𝑘 }, and quantification table. 2.4. Algorithm on fuzzy AR Algorithm 2: Fuzzy data mining by approaching the HA The signal of parameters of algorithm as following: N: The total of transaction in the database M: The total of attribute 𝐴𝑗 : jth attribute, 1 ≤ 𝑗 ≤ 𝑚 (quantitative attribute or item attribute) |𝐴𝑗 |: The number of HA labels of attribute Aj 𝑅𝑗𝑘 : HA label j of the attribute Aj , 1 ≤ 𝑘 ≤ |Aj | 𝐷 (𝑖) : ith transaction database, 1 ≤ 𝑖 ≤ 𝑁 (𝑘) 𝑣𝑗 : The kth value of Aj in D(i) (𝑖) (k) (𝑖) 𝑓𝑗𝑘 : The value of membership degree of 𝑣j with HA label R jk , 0 ≤ 𝑓𝑗𝑘 ≤ 1 Sup: The value of support of each frequent ItemSet Conf: Degree of correlation of each frequent ItemSet Min_sup: The available minimum support value Min_conf Available confidence value 𝐶𝑟 : The set of candidate ItemSets with attribute r (ItemSets), 1 ≤ r ≤ m, 𝐿𝑟 : The set of universal itemsets is hedge label r (itemset) 1 ≤ r ≤ m
- 8 The algorithm of mining database based on HA for quantitative value is carried out as follows: Input: Transaction database D, HAs for the fuzzy attribute, Min_sup and Min_conf Output: Fuzzy AR (k) Step 1: Convert the quantitative values vj of the transaction Aj in D(i) , where i (k) (k) belongs to the range (1, N). For vj , if vj is outside 1 of the 2 ends (2 labels of the (k) maximum and minimum hedge), there will be only one hedge label for that end in vj . In (k) contrast vj is represented by two successive hedge labels with the smallest value (k) segment on the value field of vj , each label corresponding to a value representing the (i) (k) membership degree fjk (j = 1, 2) in vj with that hedge label. This attribute is computed as (k) the distance of vj to the value represented by the corresponding hedge label. Step 2: Carry out the algorithm of compressed transactions (Algorithm 1) with the fuzzy DB obtained in the step 1. As a result of this step, we have the compressed transaction and quantification table. Similar to the Apriori algorithm, we apply the algorithm to the compressed DB to create a frequent ItemSets. Step 3: Based on the value in TL1 of the quantification table, value in TL1 is the support of các 𝑅𝑗𝑘 . If 𝑆𝑢𝑝(𝑅𝑗𝑘 ) ≥ min_𝑠𝑢𝑝, then R jk is put into L1 . Step 4: If L1 ≠ ∅, go to the next step; if L1 = ∅ the algorithm is ended. Step 5: The algorithm that builds the frequent itemset of level r from the frequent itemset of level r −1 by choosing 2 frequent itemsets of level r −1 when these 2 itemsets are different from each other in only one set. After joining these two itemsets, we have the candidate itemset 𝐶𝑟 . Before using the compressed DB to compute the support degree of itemsets in 𝐶𝑟 , we can eliminate some candidates without revising compressed DB, based on the value of TLr in the quantification table. Step 6: Approve compressed DB basing on the formula (4) in order to compute the support degree of each itemset in Cr . If there is any itemset which has the support degree appropriate with minimum support, it is taken to Lr Step 7: Follow the next steps and repeat frequent itemsets with greater levels, which are produced with form (or +1), the frequent itemset S with the item (𝑠1 , 𝑠2 , … , 𝑠𝑡 , … , 𝑠𝑟+1 ) trong 𝐶𝑟+1 , 1 ≤ 𝑡 ≤ 𝑟 + 1. (a) Compute the support degree sup(S) of S in the transaction; (b) If 𝑆𝑢𝑝(𝑆) ≥ 𝑀𝑖𝑛_𝑠𝑢𝑝 then S is taken to 𝐿𝑟+1 Step 8: If Lr +1 is null, then the next step is carried out; in contrast, propose r = r + 1, step 6 and step 7 are repeated. Step 9: Give the AR from the collected frequent itemset. 2.5. The experimental results The experimental results were performed with two algorithms: the proposed algorithm and the Apriori fuzzy algorithm in C # programming language and run tests on computers with the following details: Intel (R) Core (TM) i5 CPU 1.7GHz, 6GB RAM. In this chapter, the thesis uses two DB for testing: FAM95 and STULONG. 2.5.1. Experiment with the database FAM95 In Table 2.4, the statistics of the number of AR are obtained by three methods: the method used: uncompressed DB, compressed DB, and compressed DB and quantitative table. With the support of 20%, 30% of the number of combined rules of the proposed
- 9 thesis method is different from the method using the Apriori algorithm, with the support degree from 40% to 70%, the number of AR collected by the three methods is the same. Table 2.4: Number of AR obtained with confidence 80% Minimum Support Compressed DB, and Uncompressed DB Compress DB (%) Quantification table 20 238 255 255 30 98 94 94 40 34 34 34 50 18 18 18 60 6 6 6 70 2 2 2 Table 2.5: AR obtained with 60% support and 80% confidence Association Rules Support Confidence Uncompressed DB 1 { VL_INCHEAD } ==> { VL_INCFAM } 92% 97% 2 { VL_INCFAM } ==> { VL_INCHEAD } 92% 98% 3 { LY_AGE } ==> { VL_INCHEAD } 69% 98% 4 { LY_AGE } ==> { VL_INCFAM } 70% 99% 5 { VL_INCHEAD, LY_AGE } ==> { VL_INCFAM } 69% 99% 6 { VL_INCFAM, LY_AGE } ==> { VL_INCHEAD } 69% 99% Compress DB 1 { VL_INCHEAD } ==> { VL_INCFAM } 91% 98% 2 { VL_INCFAM } ==> { VL_INCHEAD } 91% 99% 3 { LY_AGE } ==> { VL_INCHEAD } 69% 99% 4 { LY_AGE } ==> { VL_INCFAM } 69% 100% 5 { VL_INCHEAD, LY_AGE } ==> { VL_INCFAM } 69% 100% 6 { VL_INCFAM, LY_AGE } ==> { VL_INCHEAD } 69% 99% Compressed DB, and Quantification table 1 { VL_INCHEAD } ==> { VL_INCFAM } 91% 98% 2 { VL_INCFAM } ==> { VL_INCHEAD } 91% 99% 3 { LY_AGE } ==> { VL_INCHEAD } 69% 99% 4 { LY_AGE } ==> { VL_INCFAM } 69% 100% 5 { LY_AGE, VL_INCHEAD } ==> { VL_INCFAM } 69% 100% 6 { LY_AGE, VL_INCFAM } ==> { VL_INCHEAD } 69% 99% Table 0.3: AR obtained with 70% support and 80% confidence Association Rules Support Confidence Uncompressed DB 1 { VL_INCHEAD } ==> { VL_INCFAM } 92% 97% 2 { VL_INCFAM } ==> { VL_INCHEAD } 92% 98% Compressed DB 1 { VL_INCHEAD } ==> { VL_INCFAM } 91% 98% 2 { VL_INCFAM } ==> { VL_INCHEAD } 91% 99% Compressed DB, and Quantification table 1 { VL_INCHEAD } ==> { VL_INCFAM } 91% 98% 2 { VL_INCFAM } ==> { VL_INCHEAD } 91% 99% It can be seen in Table 2.5, Table 2.6, the number of combination rules of the three trials (with uncompressed DB, a compressed DB without using a quantitative table, a compression DB using a quantitative table) has the same number. In Table 2.5, the
- 10 comparisons of each rule of the three corresponding methods show that the support and confidence of each law are different but not significant. Figure 0.2: Processing duration with compressed database Figure 2.3 presents the real-time comparison of the fuzzy Apriori algorithm with the uncompressed DB and the processing duration with the compressed DB without the quantitative table. In figure 2.4, the algorithm duration is compared with compression DB using quantitative table and compression DB without the quantitative table. The time it took to compress the DB was 135 seconds, the number of transactions obtained after compression was 2402 transactions. Experimental results show 60% confidence. The thesis tested with two algorithms: The association rule with the approach of HA [2] and the proposed thesis algorithm is compressing fuzzy DB in the direction of HA. The test results show that the proposed DB compression method gives faster results than the method proposed in [2]. Moreover, the values of frequent itemsets are similar to those using the uncompressed DB. 2.5.2. Experiment with the database STULONG In Table 2.7, the statistics of the number of association rules are obtained by three methods: the method using uncompressed DB, compressed DB, and compressed DB and quantitative table. Table 0.4: The number of AR obtained with confidence 80% Minimum Support Uncompressed Compressed Compressed DB, and (%) DB, DB Quantification table 5% 7822 8188 8185 10% 5076 5532 5527 20% 2149 2528 2528 30% 1096 1348 1318 40% 587 599 599 50% 248 287 287 60% 107 155 155 70% 75 75 75 80% 23 35 35 Comment: The number of AR of the proposed methodology of the thesis using compressed DB and the one using the quantification table and without using the quantitative table is the same. Table 0.5: Comparison between the processing duration of AR mining with the confidence 80% Minimum Support Uncompressed Compressed Compressed DB, and (%) DB, DB Quantification table 5% 669 41.4 41.4 10% 580 26.4 26.3 20% 187 8.3 8.3 30% 72 3.6 3.5 40% 26 1.1 1.1 50% 8 0.4 0.4 60% 3 0.2 0.2 70% 1 0.1 0.1 Table 2.9 and Table 2.10 show that the number of combination rules of three trials (with uncompressed DB, compressed DB without using quantification table, compressed
- 11 DB using quantification table) is the same. In Table 2.9, Table 2.10 comparing each rule of the three methods presents that the support and confidence of each rule are different but not significant. Table 0.6: AR obtained with the support 85% and the confidence 80% STT Association rules Support Confidence Uncompressed DB 1 { LL_A5 } ==> { LH_A2 } 86 % 97 % 2 { LH_A2 } ==> { LL_A5 } 86 % 93 % 3 { LL_A5 } ==> { VH_A1 } 88 % 99 % 4 { VH_A1 } ==> { LL_A5 } 88 % 91 % 5 { LH_A2 } ==> { VH_A1 } 92 % 99 % 6 { VH_A1 } ==> { LH_A2 } 92 % 95 % 7 { LL_A5, VH_A1 } ==> { LH_A2 } 85 % 97 % 8 { LH_A2, VH_A1 } ==> { LL_A5 } 85 % 93 % 9 { LH_A2, LL_A5 } ==> { VH_A1 } 85 % 100 % Compressed DB 1 { LL_A5 } ==> { LH_A2 } 88 % 99 % 2 { LH_A2 } ==> { LL_A5 } 88 % 95 % 3 { LL_A5 } ==> { VH_A1 } 88 % 100 % 4 { VH_A1 } ==> { LL_A5 } 88 % 91 % 5 { LH_A2 } ==> { VH_A1 } 92 % 100 % 6 { VH_A1 } ==> { LH_A2 } 92 % 95 % 7 { LL_A5, VH_A1 } ==> { LH_A2 } 87 % 99 % 8 { LH_A2, VH_A1 } ==> { LL_A5 } 87 % 95 % 9 { LH_A2, LL_A5 } ==> { VH_A1 } 87 % 100 % Compressed DB, and Quantification table 1 { B3 } ==> { A4 } 92 % 100 % 2 { A4 } ==> { B3 } 92 % 95 % 3 { E2 } ==> { A4 } 88 % 100 % 4 { A4 } ==> { E2 } 88 % 91 % 5 { E2 } ==> { B3 } 88 % 99 % 6 { B3 } ==> { E2 } 88 % 95 % 7 { B3, E2 } ==> { A4 } 87 % 100 % 8 { A4, E2 } ==> { B3 } 87 % 99 % 9 { A4, B3 } ==> { E2 } 87 % 95 % Table 0.7: AR obtained with the support 90% and the confidence 80% STT Association rule Support Confidence Uncompressed DB 1 { LH_A2 } ==> { VH_A1 } 92 % 99 % 2 { VH_A1 } ==> { LH_A2 } 92 % 95 % Compressed DB 1 { LH_A2 } ==> { VH_A1 } 92 % 100 % 2 { VH_A1 } ==> { LH_A2 } 92 % 95 % Compressed DB, and Quantification table 1 { B3 } ==> { A4 } 92 % 100 % 2 { A4 } ==> { B3 } 92 % 95 %
- 12 2.6. Conclusion In this chapter, thesis researches HA and develops the transaction DB compression algorithm used for solving fuzzy AR. With this approach, close transactions are combined to form new transactions, reducing the size of the input DB. Transaction DB compression algorithm was tested on the DB: FAM95 and STULONG. Experimental results with 2 DB showed that the proposed method of DB compression gave faster results than the proposed method in [2] and the values of frequent itemsets were similar to those when we used the uncompressed DB. The content of this chapter is shown in the works [i, ii]. In this chapter, the thesis uses HA with single-granular representations for properties with the same parameters. In order to improve the efficiency of combining AR and to find more meaningful rules, in Chapter 3, the thesis studies and proposes methods to optimize fuzzy parameters to suit each attribute with single-granular and multi-granular representation. CHAPTER 3. FUZZY PARTITIONS OF ATTRIBUTES BASED ON GRANULAR REPRESENTATION OF HEDGE ALGEBRA In this chapter, the thesis presents some ways of fuzzy domain division and proposes the method of fuzzy domain division using HA theory based on single-granular and multi- granular representation. HA allowed to model and design linguistic elements along with semantics based on fuzzy sets. The thesis proposes optimal algorithms of MFs based on HA theory for the problem of fuzzy AR mining. The experimental results show that the results of the proposed methods have advantages more than some previously proposed methods. 3.1. Partitioning the domain values of the attributes 3.1.1. Introduction The partition of determined domain to the quantitative attributes of an input data set is as follows: Propose that there is a certain domain of a given attribute (only qualitative attributes are considered). Each quantitative attribute has a definite domain (or value domain) that is the domain on the real number axis including the values that the quantitative attribute can receive. The requirement is to divide the attribute domain into granulars, and each granular has a language label denoted by a fuzzy set. In the fuzzy set theory approach, the authors divide the attribute value domain into fuzzy sets, and adjust the parameters of fuzzy sets. The assignment of linguistic labels to fuzzy sets is based on the designer's intuition. HA is derived from a linguistic awareness. Then, it can design linguistic terms and semantics based on their fuzzy set. 3.1.2. Discrete quantitative attribute There are two ways of dividing a definite domain of attributes into fuzzy and clear sub domains. The division into subdomains is evident in the following example: If A is a quantitative & discrete attribute or a categorical attribute with a finite range of values {v1 , v2 , … , vk } and k is small, the attributes will be changed. This becomes k binary properties of the form A_V1 , A_V2 , … A_Vk . The value of a record in the field A_Vi is 1 if the value of that record of the original attribute A is equal to 𝑣𝑖 . In the remaining cases, the value of A_Vi is 0. If A is a quantitative & continuous attribute or A is a discrete quantitative attribute or a category attribute with a value domain of form {v1 , v2 , … , vp } (large p), we will reflect map into q binary properties < 𝐴: start1 . . end1 >, < 𝐴: start 2 . . end2 >, …, <
- 13 𝐴: start q . . endq >. The value of a record in the field < 𝐴: start i . . endi > will be 1 if the value of that record at the original A attribute is in the range [start i . . endi ], otherwise it will be 0 . In mining fuzzy AR, we need to divide the value domain of attributes into fuzzy domains in which each fuzzy domain usually associated with a MF and linguistic label. The way of dividing a defined domain into fuzzy sub-domains has more advantages and is the way that the thesis uses . Therefore, it will be elaborated in section 3.1.3. 3.1.3. Divide the value domain of the attributes based on the fuzzy set approach Some common methods of fuzzy domain division: a) Random partitioning: A fixed number of sub-domains (partitions) is chosen and divide an item into equal areas. This method is simple and effective when there is not enough information about the DB. b) Partitioning Partition clustering: Using clustering method to find out the fuzzy set, this method takes into account the diversity of data distribution. c) Dynamically constrained partitioning: The fuzzy domain division helps us build MFs for fuzzy domains. Each MF usually has parameters to adjust the membership of values into the fuzzy domain. Optimizing the parameters of the MFs plays an important role in mining fuzzy AR. Therefore, some studies use evolutionary algorithms to increase its optimization. 3.2. Method of fuzzy partitioning by granular representation on HA In this section, the thesis presents the domain division method to determine quantitative attributes based on the approach of HA by single-granular and multi-granular representation of data. HA gives us a good mathematical structure built on the domain of identified attributes, helps us not only have a simple domain partition defined, but also allows us to attach the semantic of fuzzy subdomains to linguistic labels which it represents, always ensures the natural order of the linguistic labels. Moreover, according to the thesis, partition based on HA is always a strong one. With this approach, the AR mined will reflect better and more diverse knowledge hidden in the information explored, from highly generalized knowledge to highly separated knowledge, more detailed meet the needs of the manager. 3.2.1. Partitioning of attribute domain values using single-granular representation With some results related to the fuzzy domain of elements of HA, in section 1.2.4, it can be seen that there is a way to compute the degree of membership of any value in the numerical DB into the fuzzy sets used subdivision of fuzzy domain of section [25, 26]. Apparently, on the defined domain of item, it may be normalized to range [0,1], any value is between two quantification of semantics values of two consecutive fuzzy domains or coinciding with a price). The geographic property value of a fuzzy interval is due to the nature that creates the defined domain partition of the fuzzy spaces. Thus, the distance between the 𝑥𝑖𝑗 values and the two quantification of semantics values can be used to ccompute the attribute of 𝑥𝑖𝑗 into fuzzy sets represented by those fuzzy compartments (in case of overlapping with a quantification of semantics value, there is only 1 degree of membership): the smaller the distance is, the greater the degree of membership becomes, if they are identical, it can be considered as reaching 1. In Figure 3.1, quantification of semantics values are used to divide the defined domain of the attributes into fuzzy domains. Corresponding to each fuzzy domain, the construction of triangles represents MFs of the fuzzy set with 1 vertex whose coordinates are (𝜐(𝑥𝑖 ), 1), the other two vertices
- 14 are on the defined domain, with corresponding coordinates (𝜐(𝑥𝑖−1 ),0), (𝜐(𝑥𝑖+1 ), 0), where 𝜐(𝑥𝑖−1 ), 𝜐(𝑥𝑖 ), 𝜐(𝑥𝑖+1 ) are 3 consecutive quantification of semantics values Figure 0.1). Figure 0.1: Constructing partitioning of attribute domain values based on HA approach It can be seen that these two constructions are essentially equivalent. Indeed, suppose that we have a point E which is an arbitrary point on the axis representing the domain of the property 𝐼𝑖 . Then, in the first way, the distance 𝐸𝜈(𝑥2 ) and 𝐸𝜈(𝑥3 ) will be used to determine the degree of membership of E on the fuzzy set ZZs represented by the MFs - triangle 𝜈(𝑥1 ) 𝐵 𝜈(𝑥3 ) and 𝜈(𝑥2 ) 𝐶 𝜈(𝑥4 ), through normalization so that the membership is always in the range [0,1]. As for the second way, we have EG and EF are the dependence of E on these two fuzzy sets. 𝐸𝐺 𝐸 𝜈(𝑥3 ) 𝐸𝐹 Apparently, as EG is parallel to 𝜈(𝑥2 ) 𝐵 so )𝐵 = )𝜈(𝑥 ) . Similarly, )𝐶 = 𝜈(𝑥2 𝜈(𝑥2 3 𝑣(𝑥3 𝜈(𝑥2 )𝐸 𝐸𝐹 𝐸 𝜈(𝑥2 ) . Moreover, 𝜈(𝑥2 ) 𝐵 = 𝜈(𝑥3 ) 𝐶 = 1 so in the end we have = . Thus, it 𝜈(𝑥2 )𝜈(𝑥3 ) 𝐸𝐺 𝐸 𝜈(𝑥3 ) is easy to deduce that the two ways of attaching these degrees of membership are equivalent. It also emphasizes how to attach the degree of membership by HA is proper in terms of perception. The way of constructing MFs or equivalents are fuzzy sets to divide a defined domain of attributes based on the HA approach and it has the following advantages: - Because the construction using HA is based on the semantics that humans feel, it can be emotionally seen that the built-in MFs reflect the semantics of the fuzzy set that it represents. - It is easy to see that the coverage of the MFs is good (always covering the defined domain). From this we see if we need to optimize the suitability level of MF, we only need to optimize the overlap level and the coverage of the MF. The problem of optimizing the parameters of HA according to overlap and usefulness can be solved by a GA algorithm. - The parameters to be managed during construction are few (each parameter is a triangle, the value of LBP), when changing the initial parameters of HA, it is easy to redefine the new MF and the MF remains the same. measurements overlap and cover the same. This method is simple and reasonable. 3.2.2. Partitioning of attribute domain values using multi- granular representation The fuzzy domain subdivision method based on HA approach using single granular representation has advantages as presented, there are still limitations related to the semantics of the data. According to HA, the MFs we create above are based on the partition in terms of the same length. That Figure 0.2: Partitioning of attribute means the AR that we discover only include domain values using single- granular terms of the same length, which reduces the representation
- 15 meaning of the discoverable rules. If we are not very concerned about the data semantics, we simply divide the domain in a very mechanically defined way (as most fuzzy set approaches do), then the proposed method is single granular representation using HA is presented in section 3.2.1 which is quite good. However, considering the semantics of data - which is extremely important to acquire good knowledge in association rule mining - we must take a deeper approach. It is possible to construct semantic fuzzy domains in order to create the partitions but this way is not very exact because the partitions created are not unique. In this chapter, the thesis chooses an approach based on data representation according to the multi granular structure. With this method, in order to improve the knowledge of AR, the combined rules will be more productive. Figure 0.3: Multi-level granular structure On the ideological side, using the multi- granular representation, as mentioned, gives us a more diverse view about input information. The construction, representation and use of a granular structure often adhere to the rules of multilevel and and multiview. The multilevel rule is the advantage brought about by the granular structure shown in the multi-granulars understanding and representation. Diversity rules are associated with the objective existence of data (information granulars) and the subjective view of the data- using researcher, whereby at each level of the granular structure, information can divide in different ways. With granular computation following the two rules mentioned above, we have a structured view of data, which is both systematic and simpler in solving data mining problems. In addition, it is very important in the research direction according to the HA's approach of the thesis, granular calculation and attached to it is the representation of multi-granulars data according to the above rules to satisfy the requirements of interpretation. The requirement is that the division of granulars needs to preserve the natural linguistic order (such as "young"
- 16 In contrast, with HA, the design of fuzzy partition on the value domain of attributes at different levels of multi-granular representation is easy because it is in the way of building HA itself. In HA theory, for each value domain of the attributes, we only need to define the fuzzy parameter set of HA, we can determine the fuzzy distance of all the terms through the calculation formula to determine whether the rank how long the word is (ie, no matter how much the term is in the multi-granular representation system). Decentralization is one of the main ways that GrC uses in the construction of HA. According to HA, each term from x with the length k can be partitioned into elements ℎ𝑖 𝑥 (where ℎ𝑖 is all the hedges of HA being considered) with the length k + 1. It can be said HA is a very suitable tool for ccomputing multi-granulars. Figure 3.4 is an example of 3 granulars formulated based on semantic quantitative values of HA. The granular of level 0 has 3 MFs, the granular level 1 has 4 MFs, and the particle level 2 has 6 MFs. 3.3. The optimal method of fuzzy parameters based HA for the mining AR Figure 0.5: Optimal partition search scheme for attribute domain identification and AR mining To find the optimal MF for fuzzy AR mining, the previous authors have used several criteria to evaluate the MFs for attributes. Specifically, the suitability of the set MF used to divide the linguistic attributes 𝐼𝑞 can be assessed through three factors: the Overlap_factor measures the overlap of the MFs on each other; Coverge factor measures the value domain coverage of these MFs, and usage factor. In this section, based on semantic quantitative values of HA to build MFs for numerical attributes and apply to the problem of solving fuzzy AR. Instead of optimizing the parameters of the MF, we go to optimize the fuzzy parameters of HA. It can be seen on the above figure, a schema of the MF and AR mining consists of two steps: Step 1: Looking for attribute function: with HA parameters of attributes. We can easily build MFs for attributes as shown in Section 3.2 to compute the target function. At the end of step 1 we obtain the set of parameters for the HA. From the parameters of the HA, we can easily build the MF in step 2. Step 2: Mining AR: We use the HA parameters obtained in step 1 to blur the transaction DB and proceed to mine fuzzy AR. We obtain a set of AR denoted by linguistic information at the end of this step. 3.3.1. MFs codification To build attribute functions for attributes, the thesis uses structured HA
- 17 𝐴𝑋 = (𝑋, 𝐺, 𝐻, ≤) where: 𝐺 = {𝐶 − = {𝐿𝑜𝑤} ∪ 𝐶 + = {𝐻𝑖𝑔ℎ}}; 𝐻 = {𝐻 − = {𝐿𝑖𝑡𝑡𝑙𝑒} ∪ 𝐻 + = {𝑉𝑒𝑟𝑦}}; 𝛼 = 𝜇(𝐿𝑖𝑡𝑡𝑙𝑒) = 1 − 𝜇(𝑉𝑒𝑟𝑦), 𝛽 = 𝜇(𝑉𝑒𝑟𝑦); 𝑤 = 𝑓𝑚(𝐿𝑜𝑤) = 1 − 𝑓𝑚(𝐻𝑖𝑔ℎ). With the above structure of HA, there are four sets of parameters: 𝜇(𝐿𝑖𝑡𝑡𝑙𝑒), 𝜇(𝑉𝑒𝑟𝑦), 𝑓𝑚(𝐶 − ), 𝑓𝑚(𝐶 + ). The parameter 𝛼 = 𝜇(𝑉𝑒𝑟𝑦) = 1 − 𝜇(𝐿𝑖𝑡𝑡𝑙𝑒), and 𝑤 = 𝑓𝑚(𝐿𝑜𝑤) = 1 − 𝑓𝑚(𝐻𝑖𝑔ℎ), Therefore, for each HA, we only need to find two parameters 𝛼 and 𝑤 instead of finding all four parameters. Based on the parameters of the HA of the attributes, we construct the MFs in the form of single granulars as presented in section 3.2.1 or multi-granulars representation as presented in section 3.2.2. We need to look for fuzzy parameters of HA 𝐴𝑋𝑖 for n quantitative attributes, each HA has two parameters 𝛼𝑖 , 𝑤𝑖 (i=1,…,n).. Thus, to represent a chromosome, an array of real numbers of size 2 * n is needed. The structure of a gene is as follows: (𝛼 , … , 𝛼 , 𝑤 , … , 𝑤 ) (3.1) 1 𝑛 1 𝑛 3.3.2. Chromosomal evaluation The objective function of a chromosome 𝐶𝑞 is defined as follows: ∑𝑥∈𝐿1 𝑓𝑢𝑧𝑧y_support(x) (3.2) 𝑓𝑖𝑡𝑛𝑒𝑠𝑠(𝐶𝑞 ) = 𝑠𝑢𝑖𝑡𝑎𝑏𝑖𝑙𝑖𝑡𝑦(𝐶𝑞 ) Where: 𝐿1 is the frequent 1- itemset that uses MFs in 𝐶𝑞 , fuzzy_support (x): fuzzy support of 1-itemset x is calculated from the transaction DB, suitability (Cq): appropriate coverage of MF in Cq. The suitability of the set of chromosomes of MFs in Cq is defined as follows: 𝑛 𝑠𝑢𝑖𝑡𝑎𝑏𝑖𝑙𝑖𝑡𝑦(𝐶𝑞 ) = ∑[𝑜𝑣𝑒𝑟𝑙𝑎𝑝_𝑓𝑎𝑐𝑡𝑜𝑟(𝐶𝑞𝑘 ) + 𝑐𝑜𝑣𝑒𝑟𝑎𝑔𝑒_𝑓𝑎𝑐𝑡𝑜𝑟(𝐶𝑞𝑘 )] (3.3) 𝑘=1 Where n is number of items, 𝑜𝑣𝑒𝑟𝑙𝑎𝑝_𝑓𝑎𝑐𝑡𝑜𝑟(𝐶𝑞𝑘 ) is the overlap factor of the MFs for an item 𝐼𝑘 in the chromosome 𝐶𝑞 , and 𝑐𝑜𝑣𝑒𝑟𝑎𝑔𝑒_𝑓𝑎𝑐𝑡𝑜𝑟(𝐶𝑞𝑘 ) is the coverage factor of the MFs for an item 𝐼𝑘 in the chromosome 𝐶𝑞 . The 𝑂𝑣𝑒𝑟𝑙𝑎𝑝_𝑓𝑎𝑐𝑡𝑜𝑟 presents the ratio of overlying between the MFs in the chromosome 𝐶𝑞 . This ratio is defined as: 𝑜𝑣𝑒𝑟𝑙𝑎𝑝(𝑅𝑖 ,𝑅𝑗 ) (3.4) Overlap_factor(𝐶𝑞𝑘 ) = ∑𝑚 𝑚 𝑘=1 ∑𝑗=𝑖+1 [𝑚𝑎𝑥 ( , 1) − 1] 𝑚𝑖𝑛(𝑠𝑝𝑎𝑛𝑅𝑅𝑖 ,𝑠𝑝𝑎𝑛𝐿𝑅𝑗 ,) The 𝐶𝑜𝑣𝑒𝑟𝑎𝑔𝑒_𝑓𝑎𝑐𝑡𝑜𝑟 is the ratio of coverage of the MFs to an item 𝐼𝑘 in the chromosome 𝐶𝑞 and it is defined as: 1 Coverage_factor(𝐶𝑞𝑘 ) = (3.5) 𝑅𝑎𝑛𝑔(𝑅1 , … , 𝑅𝑚 ) 𝑚𝑎𝑥(𝐼𝑘 ) where R𝑎𝑛𝑔(𝑅1 , … , 𝑅𝑚 ) is the coverage range of the MFs and 𝑚𝑎𝑥(𝐼𝑘 ) is the maximum quantity of 𝐼𝑘 in the transactions. 3.4. The algorithm of mining optimal fuzzy partition and AR The algorithm consists of two phases: Phase 1: Finding the optimal fuzzy partition based on the input transaction DB. Phase 2: Using algorithm of fuzzy AR mining based MFs obtained in Phase 1. Algorithm content:
- 18 Input: Transaction data with T quantities, n-item set (each item has m linguistic terms), the support min_𝑠𝑢𝑝𝑝, the confidence min_𝑐𝑜𝑛𝑓 and the population size N. Output: Set of fuzzy AR with the membership set of MFs. Phase 1: Finding the optimal fuzzy partition based on the transaction database T Step 1: Initial population generation with random N chromosomes. Chromosome representation is (𝛼1 , … , 𝛼𝑛 , 𝑤1 , … , 𝑤𝑛 ), with each pair of (𝛼𝑖 , 𝑤𝑖 ) is a HA and i=1,..,n. Step 2: Encoding the MFs into an encoded string as described in section 3.3.1. Based on the HA used in Step 1, construct MFs for the attributes in the original DB as described in section 3.2. Also, the function representation in the form of single-granular and multi-granular can be used. Step 3: Compute the fitness function for each chromosome in the population as follows: Step 3.1: For each transaction 𝐷𝑖 , i=1…n, and for each item 𝐼𝑗 , j=1…m convert the (𝑖) (𝑖) (𝑖) (𝑖) 𝑓 𝑓𝑗2 𝑓𝑗𝑙 quantitative value 𝑣𝑗 , ( 𝑗1 + + ⋯+ ) represents the the membership set of a 𝑅𝑗1 𝑅𝑗2 𝑅𝑗𝑙 (𝑖) (𝑖) chromosome. Where 𝑅𝑗𝑘 is the kth linguistic term of item 𝐼𝑗 , 𝑓𝑗𝑙 : 𝑣𝑗 𝑖s the jth MF of the value 𝐼𝑗 , 1 is the number of the fuzzy domain. Step 3.2: Compute the value of each fuzzy domain: (𝑖) 𝑐𝑜𝑢𝑛𝑡𝑗𝑘 = ∑𝑛𝑖=1 𝑓𝑗 (3.6) Step 3.3: Each fuzzy domain 𝑅𝑗𝑘 , 1 ≤ 𝑗 ≤ 𝑚, 1 ≤ 𝑘 ≤ |𝐼𝑗 |, check if 𝑐𝑜𝑢𝑛𝑡𝑗𝑘 larger than or equal to the minimum support threshold min_supp. If 𝑅𝑗𝑘 satisfies the above condition, put it in the frequent set 1-Itemset (𝐿1 ). 𝐿1 = {𝑅𝑗𝑘 | 𝑐𝑜𝑢𝑛𝑡𝑗𝑘 ≥ 𝛼, 1 ≤ 𝑗 ≤ 𝑚, 1 ≤ 𝑘 ≤ |𝐼𝑗 |} Step 3.4: The objective value of the chromosome is computed as the following formula: ∑𝑥∈𝐿1 𝑓𝑢𝑧𝑧𝑦_𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝑥) (3.7) 𝑓𝑖𝑡𝑛𝑒𝑠𝑠(𝐶𝑞 ) = 𝑠𝑢𝑖𝑡𝑎𝑏𝑖𝑙𝑖𝑡𝑦(𝐶𝑞 ) Step 4: Carry on the hybridization in the population Step 5: Use conditional selection to select individuals in the population to create the next generation. Step 6: If the best chromosome does not change, go back to Step 3; otherwise, go to the next step. Step 7: The MF is selected from the individual with the largest target function value in the population. Phase 2: Fuzzy AR mining: Use the algorithm of fuzzy AR mining 3.5. Experimental results In this section, we will describe the DB used in testing and the experimental results with two proposed thesis methods: using single-granular and multi-granular data representation. The parameters of GA algorithm are as follows: the population size is 50; the number of generations is 10000, the number of bits per gene is 30, the probability of breeding is 0.6.
CÓ THỂ BẠN MUỐN DOWNLOAD
-
Summary of the Doctoral thesis in public Management: State management on information technology industry in Vietnam
30 p | 46 | 4
-
Summary of computer and Information technology doctoral thesis: Some hybrid methods in fuzzy rough set based attribute reduction in decision tables
26 p | 33 | 4
-
Summary of master thesis in Computer science: Operational detection and management of ships in Vietnam coastal region using VNredsat-1-Image
27 p | 19 | 3
Chịu trách nhiệm nội dung:
Nguyễn Công Hà - Giám đốc Công ty TNHH TÀI LIỆU TRỰC TUYẾN VI NA
LIÊN HỆ
Địa chỉ: P402, 54A Nơ Trang Long, Phường 14, Q.Bình Thạnh, TP.HCM
Hotline: 093 303 0098
Email: support@tailieu.vn