Summary of computer and Information technology doctoral thesis: Some hybrid methods in fuzzy rough set based attribute reduction in decision tables

Chia sẻ: _ _ | Ngày: | Loại File: PDF | Số trang:26

Thêm vào BST

Báo xấu

36
lượt xem 4
download

Download Vui lòng tải xuống để xem tài liệu đầy đủ

The first is the hybrid filterwrapper algorithm proposal for finding the reduction set of the decision table using advanced fuzzy distance measurements and other measurements basing on fuzzy rough set to minimize2 the number of attributes of reduction set. Reduce and improve the accuracy of the classification model.

Chủ đề:

Bình luận(0) Đăng nhập để gửi bình luận!

Lưu

Nội dung Text: Summary of computer and Information technology doctoral thesis: Some hybrid methods in fuzzy rough set based attribute reduction in decision tables

MINISTRY OF EDUCATION VIETNAM ACADEMY OF AND TRAINING SCIENCE AND TECHNOLOGY GRADUATE UNIVERSITY OF SCIENCE AND TECHNOLOGY ------------------------------- NGUYEN VAN THIEN SOME HYBRID METHODS IN FUZZY ROUGH SET BASED ATTRIBUTE REDUCTION IN DECISION TABLES Major: Information Systems Code: 9 48 01 04 SUMMARY OF COMPUTER AND INFORMATION TECHNOLOGY DOCTORAL THESIS Ha noi, 2018
List of works of author 1 Nguyen Van Thien, Nguyen Long Giang, Nguyen Nhu Son, “Attribute reduction method in decision table with attribute value domain receiving numerical value based on fuzzy rough set”, Journal on works of research, development and application of Information Technology & Communication, Journal of Science and Technology of Ministry of Information and Communications, Episode V-2, no. 16 (36), 12-2016, pp. 40-49. 2 Nguyen Van Thien, Janos Demetrovics, Vu Duc Thi, Nguyen Long Giang, Nguyen Nhu Son, “A Method to Construct an Extension of Fuzzy Information Granularity Based on Fuzzy Distance”, Serdica Journal of Computing 10 (2016), Sofia, Bulgarian Academy of Sciences, No 1, 2016, pp. 13-30. 3 Nguyen Long Giang, Nguyen Van Thien, Cao Chinh Nghia, “An attribute reduction method in decision table with continuous value domain based on fuzzy rough set approach”, Proceedings of 18th National Conference: Selected Problems of Information Technology & Communication - TP HCM, 05-06/11/2015. 4 Nguyen Van Thien, Nguyen Nhu Son, Nguyen Long Giang, Cao Chinh Nghia, “On the Constructing of Extended Fuzzy Information Granularity based on Fuzzy Distance”, Proceedings of 18th National Conference: Selected Problems of Information Technology & Communication, Hanoi, 01- 02/10/2016, Page 371-376. 5 Nguyen Long Giang, Nguyen Van Thien, Cao Chinh Nghia, “About a Direct Attribute Reduction Method on Decision Tables using Fuzzy Distance”, Proceeding of the 9th National Conference on Fundamental and Applied research of Information Technology (FAIR’9), Can Tho, 04-05/08/2016, pp. 825-835. 6 Nguyen Van Thien, Nguyen Long Giang, Nguyen Nhu Son , “Fuzzy Partition Distance based Attribute Reduction in Decision Tables”, IJCRS'2018: International Joint Conference on Rough Sets 2018, Quy Nhon, Viet Nam, August 20-24, 2018 (Accepted) 7 Nguyen Van Thien, Nguyen Long Giang, Nguyen Nhu Son, “An Incremental Attribute Reduction Method in Decision Tables using Fuzzy Distance”, Proceedings of 21th National Conference: Selected Problems of Information Technology & Communication, Thanh Hoa, 27-28/07/2018, pp. 296- 302.
INTRODUCTION Fuzzy rough set theory proposed by Dubois et al. [22, 23] is a combination of rough set theory and fuzzy set theory to approximate fuzzy sets based on a fuzzy equivalent relation defined on the attribute value region. Since its appearance, fuzzy rough set theory is an effective tool for solving directly attribute reduction problem on the original decision table (decision table not to through the data discretization step) to improve accuracy of the classification model. Researches related to the attribute reduction based on approaching fuzzy rough set have been quite exciting in the past few years, including main methods such as the method using fuzzy positive region [2, 72, 80, 92], the method using the fuzzy distinction matrix [34, 42, 29, 30, 69], the method using fuzzy entropy [45, 70, 71, 74, 91, 75, 33, 55], the method using fuzzy distance [3, 8, 18]. More recently, some researchers have proposed extended methods based on different measurements defined [14, 19, 21, 30, 33, 35, 46, 47, 59, 68, 85, 90, 100]. However, as with conventional methods of attribute reduction based on approaching rough set, most of attribute reduction based on approaching fuzzy rough set are filter methods, which means that the accuracy of the classification model is evaluated after finding the reduction set. The reduction set obtained only satisfies the conditions of measurement preservation without having the highest classification accuracy. Therefore, the reduction set of filter methods has not optimized the number of attributes and classification accuracy. Nowadays, decision tables are often large and always changed, updated. The algorithms application for finding reduction set based on approaching conventional rough set and extended rough set models have challenges. In case the decision tables are changed, these algorithms have to recalculate the reduction set on the whole decision table after changing so that the cost of calculation time increases significantly. In the case the decision table has large size, implementation of the algorithm on the whole decision table will have difficulty in implementation time. Thus, the researchers propose a more calculation approach direction to increase the finding of reduction set. Incremental algorithms are capable of minimizing implementation time and being implemented on large decision tables by dividing the decision table into parts. Basing on approaching conventional rough set and tolerance rough set, the researches related to incremental algorithms for finding reduction set of decision table is changed excitingly and focused on the cases: additions and removal of objects [36, 37, 38, 49, 56, 66, 86, 95, 96, 102], additions and removal of attributes [31, 38, 49, 54, 86, 87, 88, 89]. Using distance measurements, the authors in [24, 65] have developed formulas that increase of distance calculation, according to the basis to development the incremental algorithm for finding reductions set in the case of additions and removal of object set and additions and removal of attribute set. In recent years, some research teams have proposed incremental algorithms for finding reduction set on decision table based on approaching fuzzy rough set in the following cases: additions and removal of attribute set [15, 16 ], additions of object set [97, 98, 99]. Incremental algorithms for finding above mentioned reduction set based on approaching fuzzy rough set have significantly smaller implementation time than non- incremental algorithms and can be implemented on large data tables. However, the above mentioned algorithms are based on the conventional filter approach direction. Thus, basing on approaching fuzzy rough set, the thesis researches the filter-wrapper incremental algorithm to find the approximated reduction set to minimize the number of attributes of the reduction set and improve the accuracy of the classification model. The objective of the thesis is to focus on two main topics. The first is the hybrid filter- wrapper algorithm proposal for finding the reduction set of the decision table using advanced fuzzy distance measurements and other measurements basing on fuzzy rough set to minimize 1
the number of attributes of reduction set. Reduce and improve the accuracy of the classification model. The second is the filter-wrapper incremental algorithm proposal for finding the reduction set of the decision table changes the use of the fuzzy distance measurement based on approaching fuzzy rough set to minimize the implementation time and improve the accuracy compared to other incremental algorithms. With the objectives set out, the thesis has two main results as follows: 1) Propose two filter-wrapper algorithms for finding the reduction set of decision tables based on approaching fuzzy rough set: Algorithm using fuzzy function and algorithm using fuzzy distance. The fuzzy distance measurement implemented is an extension of the distance measurement in the work [48]. These contributions are presented in Chapter 2 of the thesis and published in Works 1, 2, 5, 6, 7. 2) Propose two filter-wrapper incremental algorithms for finding the reduction set of decision tables in the case of addition of object sets and the removal of object sets using fuzzy distance measurements implemented. These contributions are presented in Chapter 3 of the thesis and published in Work 8. The layout of the thesis consists of the introduction and the three chapters of content, the conclusions and the references. Chapter 1 presents the basic concepts of conventional rough set theory, model of the fuzzy rough set, and the overview of the filter-wrapper approach in attribute reduction. Chapter 1 also presents researches related to the attribute reduction based on approaching fuzzy rough set, the incremental method of attribute reduction in recent years. Chapter 2 presents two research results: the first is to propose filter-wrapper algorithms for finding the reduction set using a fuzzy function; the second is to implement a fuzzy distance measurement and propose filter-wrapper algorithms for finding the reduction set using the fuzzy distance implemented. Both proposals aim to minimize the number of attributes of reduction set and improve the accuracy of classification model compared to the previous filter methods. Chapter 3 proposes two filter-wrapper incremental algorithms; The first is filter-wrapper incremental algorithm for finding reduction set of the decision table in the case of addition of attribute set; The second is filter-wrapper incremental algorithm for finding reduction set of the decision table in the case of removal of attribute set. Finally, the conclusion states the contribution of the thesis, the direction of development and the concerned problem of the author. Chapter1. OVERVIEW 1.1. Concepts in theory of fuzzy rough set 1.1.1. Fuzzy equivalent relation Definition 1.1. [32, 71] Given decision table DS  U , C  D  , a relation R defined on attribute value region called as fuzzy equivalent relation if the following conditions are satisfied with all x, y, z U 1) Reflexive: R  x, x   1 ; 2) Symetric: R  x, y   R  y, x  ; 3) Max-min transitive: R  x, z   min R  x, y  , R  y, z  ; with R  x, y  is value of relation between x and y. 2
Proposition 1.1. [72] Given decision table DS  U , C  D  and fuzzy equivalent relation R . Symbols R P , R Q respectively as relations R defined on attribute set P, Q. Then, with all x, y U we have : 1) R P  RQ  R P  x, y   RQ  x, y   2) R PQ  R P  RQ  R  x, y   max R P  x, y  , RQ  x, y   3) R PQ  R P  RQ  R  x, y   min R P  x, y  , RQ  x, y  4) R P  RQ  R P  x, y   RQ  x, y  1.1.2. Fuzzy equivalent relational matrix Definition 1.2. Given decision table DS  U , C  D  with U  x1 , x2 ,..., xn  and R P is fuzzy equivalent relation defined on the attribute set P  C . Then, Fuzzy equivalent relational matrix shows R P , symbolized as M  R P    pij nn , is defined as below:  p11 p12 ... p1n  p p22 ... p2 n  M ( R P )   21   ... ... ... ...     pn1 pn 2 ... pnn  with pij  R P  xi , x j  is value of relation between xi and x j on attribute set P, pij  0,1 , xi , x j U ,1  i, j  n . So, value of elements of fuzzy equivalent relational matrix M  R P  depends on fuzzy equivalent relation R P selected 1.1.3. Fuzzy partition Definition 1.3. Given decision table DS  U , C  D  with P  C , U  x1 , x2 ,..., xn  and R P is fuzzy equivalent relation on P. Then, fuzzy partition on U derived by R P , symbolized as   R P  , is defined as below    R P  U / R P   xi P i 1   x1 P ,...,  xn P  n (1.8) with  xi P  pi1 / x1  pi 2 / x2  ...  pin / xn is a fuzzy set playing the role of fuzzy equivalent class of xi U . With fuzzy equivalent class  xi P , function of x j U defined by  x   x j   R  xi , x j   R P  xi , x j   pij and the number of fuzzy equivalent class  xi P calculated i P P n by  xi P   pij . j 1 1.1.4. Fuzzy approximation sets and fuzzy positive region Definition 1.4. [66, 70, 85, 87] Given X is fuzzy set on U and R P is fuzzy equivalent relation on attribute set P  C . Then, fuzzy lower approximation set R P X and fuzzy upper approximation set R P X of X is fuzzy sets and functions of x U defined as below: R PX  x  FU / R P  sup min F  x  ,inf max 1  F  y  ,  X  y  yU  (1.9) 3
  R sup min  F  x  ,sup min F  y  ,  X  y   x  (1.10) PX FU / R P  yU  With symbols inf, sup respectively as infimum and supremum of set X; F is fuzzy equivalent classes of fuzzy partition U / R P . With sets of fuzzy lower approximation and fuzzy upper approximation defined by Definition 1.6, set R P X , R P X called as fuzzy rough set. Definition 1.5 [66] Given decision table DS  U , C  D  and R P , RQ respectively as two fuzzy equivalent relations defined on P, Q  C . Then, fuzzy positive region of R Q to R P , symbolized as POSR  RQ  , is a fuzzy set which function of x U defined as below: P POS  R   x   sup R Q PX  x (1.11) RP X U / R Q 1.2. Attribute reduction 1.2.1. Overview Attribution is an important problem in the data preprocessing step with the objective of removing redundant, unrelated attributes to increase the efficiency of data mining algorithms: Increase the speed, improve the quality and comprehensibility of the results. Attribute reduction techniques are often classified into two types: Attribute selection and Attribute transformation. In this thesis, we research the approach direction of attribute selection, referred to as the attribute reduction. 1.2.2. Filter, wrapper approaches in attribute reduction There are currently two main approaches to the attribute reduction [43, 44]: filter and wrapper. The filter approach implements independent attribute reduction of the data mining technique used in the future. So far, most attribute reduction methods based on rough set theory and extensions are in this approach. The wrapper approach conducts the selection by applying the mining technique immediately, and the accuracy of the result is taken as the standard for selecting the attribute subset. Filter approach has the advantage of fast calculation time, the disadvantage is not to use the class label information of data set, so the accuracy is not high. Attribute set Subset Algorithm selected Selected Attribute subset Attribute set Subset set out Algorithm Evaluation Figure 1.2. Filter and wrapper approaches in attribute reduction In order to combine the advantages of both filter and wrapper approaches, a number of new approaches have been proposed by the authors, such as the hybrid filter-wrapper approach [67, 91]. 4
1.3. Related works to attribute reduction based on approaching fuzzy rough set 1.3.1. Related works Up to now, researches related to direct attribute reduction on the original decision table based on approaching fuzzy rough set focus on key methods such as: the use method of fuzzy positive region [2, 72, 80, 92], the use method of a fuzzy distinction matrix [34, 42, 29, 30, 69], use method of the fuzzy entropy [45, 70, 71, 74, 91, 75, 33, 55]; the use method of a fuzzy distance [3, 8, 18]. More recently, some researchers have proposed extended methods based on different measurements defined [14, 19, 21, 30, 33, 35, 46, 47, 59, 68, 85, 90, 100]. The test results on the set of sample data show that the attribute reduction methods based on approaching fuzzy rough set has higher classification accuracy than the attribute reduction methods based on approaching conventional rough set. 1.3.2. Problems As with the attribute reduction methods based on approaching rough set, most of the attribute reduction method based on approaching fuzzy rough set published is as heuristic method based on approaching filter. That is, classification accuracy is evaluated after finding a reduction set. 1) The reduction set of methods based on approaching above mentioned filter has not optimized the number of attributes and accuracy of classification. 1.4. Related works on incremental method for finding reduction set based on approaching fuzzy rough set 1.4.1. Related works on incremental method for finding reduction set based on approaching conventional fuzzy rough set and tolerance rough set Based on approaching conventional fuzzy rough set and tolerance rough set, the research related to incremental algorithms for finding reduction set of decision table is changed excitingly and focused on the cases: addition and removal of objects [36, 37, 38, 49, 56, 66, 86, 95, 96, 102], addition and removal of attributes [31, 38, 49, 54, 86, 87, 88, 89]. Using distance measurements, the authors in [24, 65] have developed formulas of increasing distance calculation, on that basis to develop the incremental algorithm for finding reduction set in the case of addition, removal of object set and addition, removal of attribute set. 1.4.2. Related works on incremental methods for finding reduction set based on approaching conventional fuzzy rough set In recent years, some research teams have proposed incremental algorithms for finding reduction set on the decision table changing based on approaching fuzzy rough set. Zeng et al. [15] developed an incremental algorithm for finding reduction set uusing fuzzy function in the case of addition and removal of an attribute (FRSA-IFS-HIS-AA and FRSA-IFS- HIS-AD). In the case of addition of object set, Yang et al. [98] developed the IARM algorithm for finding reduction set using distinction relations. Yang et al. [99] proposed two versions of incremental the algorithm for finding reduction set in the case of addition of object set: V-FS-FRS-1 and V- FS-FRS-2 algorithms. Liu et al. [97] developed formulas to calculate the increase of fuzzy function in the case of addition of object sets, on that basis to develop incremental algorithm for finding reduction set using FIAR fuzzy function. 1.4.3. Problems 1) The incremental algorithms for finding reduction set based on approaching fuzzy rough set have significantly smaller implementation time than non-incremental algorithms and can be implemented on large data tables. However, the above mentioned algorithms follow the conventional filter approach. Thus, the reduction set of algorithms mentioned above is not optimal both in terms of number of attributes and accuracy of classification. 5
2) The researches related to the incremental method described in Item 1.4.3.2 have resolved the attribute reduction problem in the case of addition of object set, addition and removal of attribute set, updating the attribute set, have not resolved problem of removal of object set. Chapter 2. ATTRIBUTE REDUCTION IN DECISION TABLE USING FUZZY FUNCTION AND FUZZY DISTANCE 2.1. Introduction In this chapter, the thesis proposes two algorithms based on approaching the hybrid filter-wrapper for finding an approximate reduction set to minimize the number of attributes of reduction set and improve the accuracy of the classification model. The filter phase finds the candidates for the reduction set based on the measurement (also called the approximation set), the wrapper phase calculates the candidate's classification accuracy and selects approximate reduction set having the highest classification accuracy. (1) The filter-wrapper algorithm for finding reduction set using the fuzzy function in the fuzzy rough set. (2) The filter-wrapper algorithm for finding reduction set using fuzzy distance. The fuzzy distance constructed is the extension of the partition distance in the work [48] and different to distance measurements in the works [3, 8, 18]. The results in this chapter are published in works 1, 2, 5, 6, 7. 2.2. Attribute reduction using fuzzy function 2.2.1. Attribute reduction using fuzzy function based on filter approach 1) Fuzzy function in fuzzy rough set Given decision table DS  U , C  D  with U  u1 ,..., un , C  c1,..., cm  . With P  C , suppose that R P is fuzzy equivalent relation defined on attribute value region P. Fuzzy function of P based on relation R P is defined in fuzzy rough set as below [77, 78] POS D  x   xU POS D  x  R  D  RP  RP P U U 2) Heuristic algorithm for finding reduction set using the fuzzy dependence of attribute based on approaching filter. F_FRSAR algorithm (Filter Fuzzy Rough Set based Attribute Reduction). Input: Decision table DS  U , C  D  , fuzzy equivalent relation R defined on conditional attribute value region. Output: Reduction set B of DS 1. B :  ;    D  : 0 ; 2. Calculate fuzzy equivalent relational matrix M RC ;   3. Calculate fuzzy function  RC  D  ; // add gradually the greatest important attributes into B 4. While  RB  D    RC  D  do 5. Begin 6. With each a  C  B calculate SIGB  a    R B a  D   R  D ; B 6
7. Select am  C  B so that SIGB  am   Max SIGB  a  ; aC  B 8. B  B  am  ; 9. Calculate  R B  D  ; 10. End; // Remove redundant attributes in B if any 11. For each a  B 12. Begin 13. Calculate  R B a  D  ; 14. If  R B a  D    R  D  then C B : B  a ; 15. End; 16. Return B; Complexity of F_FRSAR algorithm is O C U  2 2  2.2.2. Attribute reduction using fuzzy function based on filter-wrapper approach Consider decision table DS  U , C  D  with C  a1, a2 ,..., am  and R is fuzzy equivalent relation defined on attribute value region. Set    R  D  . According to F_FRSAR algorithm, C suppose that ai , ai ,... added into empty set based on the greatest value of attribute importance 1 2 until exist t 1,2,...m so that  R ai , ai ,..., ai   D    . Finish filter F_FRSAR algorithm, we have 1 2 t reduction set B  ai , ai ,..., ai  and classification accuracy on data set calculated on B. 1 2 t On the other hand, according to the definition of fuzzy positive region in theory of fuzzy rough set and [76, 77, 78, 79] we have  R   D    R   D   ...   R   D    . With threshold ai ai , ai ai ,...,ai 1 1 2 1 t    given, set Bk  ai ,..., ai  to satisfy   D    and     D    . Then, B is called as 1 k R Bk R Bk  aik 1 k approximate reduction set  . If B and B  a ,..., a  used to develop classifier, publication k k ik 1 it [91] shows that, classification accuracy on B  a ,..., a  is not better than on B . Suppose k ik 1 it k that B has better classification accuracy than B  a ,..., a  . Then, If select B to be result k k ik 1 it k of algorithm, Bk shall have higher classification accuracy, has the number of attributes smaller, so the generalization and performance of the algorithm will be higher. This leads to the direction of approaching hybrid for finding approximate reduction set, is combination of filter and wrapper. Filter method find out approximate reduction sets, wrapper method check the classification accuracy of approximate reduction sets to select the highest accuracy reduction set. By this approach, the classification accuracy on the reduction set found is higher than conventional filter method. However, implementation time will be higher because of implementing classifiers. Filter-wrapper algorithm for finding approximate reduction set using fuzzy function as below: FW_FRSAR algorithm (Filter-Wrapper Fuzzy Rough Set based Attribute Reduction): Filter- wrapper algorithm for finding approximate reduction set using fuzzy function. Input: Decision table DS  U , C  D  , fuzzy equivalent relation R defined on conditional attribute value region. 7
Output: Approximate reduction set Bx has the best classification accuracy. // Initialize 1. B :  ;    D   0 ; 2. Calculate fuzzy function  RC  D  ; // Filter phase, finds candidates for reduction set // Add gradually the greatest important attributes into P 3. While  R  D    R  D  do B C 4. Begin 5. With each a  C  B calculate SIGB  a    R B a  D   R  D B 6. Select am  C  B so that SIGB  am   Max SIGB  a  ; aC  B 7. B  B  am ; 8. End; // Wrapper phase, finds the greatest classification accuracy of reduction Set t  B //t is number of elements of B, B includes attribute string selected at each  iterative step of While loop, it means that B  ai ,ai , ai ,...,ai , ai ,..., ai  ; 1 1 2 1 2 t  9. Set B1  ai , B2  ai , ai ,..., Bt  ai , ai ,..., ai  1 1 2 1 2 t 10. For j = 1 to t 11. Begin 12. Calculate the classification accuracy B j by a classifier using 10-fold method; 13. End 14. Bx  B jo with B jo has the greatest classification accuracy. Return Bx ; 2.2.3. Experimental results 1) Data set Table 2.2. Data set of F_FRSAR, FW_FRSAR algorithms No. of conditional attributes No. of No. of No. Data Set Description Real- class objects All Nominal valued decided attribute attribute 1 Ecoli Protein Localization 336 7 0 7 8 Sites 2 Ionosphere Johns Hopkins 351 34 0 34 2 University Ionosphere database 3 WDBC Wisconsin diagnostic 569 30 0 30 2 breast cancer 4 Wpbc Wisconsin 198 33 0 33 2 Prognostic Breast Cancer 8
5 Wine Wine recognition 178 13 0 13 3 data 6 Glass Glass Identification 214 9 0 9 7 Database 7 Magic04 MAGIC gamma 19020 10 0 10 2 telescope data 2004 8 Page- Blocks Classification 5473 10 0 10 5 blocks 2) Evaluate the classification accuracy of F_FRSAR filter algorithm with other algorithms based on approaching fuzzy rough set Table 2.4. The classification accuracy of GAIN_RATIO_AS_FRS and F_RSAR GAIN_RATIO_AS_FRS F_FRSAR algorithm algorithm [45] Data U C No. R SVM C4.5 R SVM C4.5 set classifica classifica classifica classifica tion tion tion tion accuracy accuracy accuracy accuracy 1 Ecoli 336 7 6 0.814 0.802 7 0.865 0.855 2 Ionos 351 34 13 0.916 0.904 15 0.937 0.915 phere 3 Wdbc 569 30 17 0.925 0.917 19 0.980 0.975 4 Wpbc 198 33 17 0.815 0.804 19 0.825 0.818 5 Wine 178 13 9 0.910 0.902 10 0.955 0.920 6 Glass 214 9 7 0.891 0.882 7 0.891 0.882 Magi 190 7 10 6 0.782 0.765 6 0.782 0.765 c04 20 Page- 547 8 block 10 6 0.852 0.848 7 0.865 0.855 3 s The classification accuracy of F_FRSAR is higher than the classification accuracy of GAIN_RATIO_AS_FRS in [45]. F_FRSAR reduction set preserves the fuzzy positive region and more attributes than GAIN_RATIO_AS_FRS algorithm in [45]. 3) Evaluate the classification accuracy of FW_FRSAR filter-wrapper algorithm with F_FRSAR filter algorithm and other filter algorithms based on approaching fuzzy rough set Table 2.5. The classification accuracy of FW_FRSAR, F_FRSAR, GAIN_RATIO_AS_FRS Initial FW_FRSAR F_FRSAR GAIN_RATIO data set algorithm algorithm _AS_FRS No. Data set algorithm [45] U C R Classification R Classification R Classification accuracy accuracy accuracy 9
1 Ecoli 336 7 5 0.901 7 0.855 6 0.802 2 Ionosphere 351 34 8 0.946 15 0.915 13 0.904 3 Wdbc 569 30 6 0.975 19 0.975 17 0.917 4 Wpbc 198 33 12 0.867 19 0.818 17 0.804 5 Wine 178 13 5 0.920 10 0.920 9 0.902 6 Glass 214 9 4 0.924 7 0.882 7 0.882 7 Magic04 19020 10 4 0.886 6 0.765 6 0.765 Page- 8 5473 10 5 0.906 7 0.855 6 0.848 blocks Table 2.5 shows that the number of attributes of reduction set of FW_FRSAR filter- wrapper algorithm is much smaller, especially for Wdbc, Ionosphere data sets. Furthermore, the accuracy of FW_FPDBAR is higher than F_DBAR and GAIN_RATIO_AS_FR. 4) Compare the implementation time of FW_FRSAR, F_FRSAR and GAIN_RATIO_AS_FRS Table 2.6. Implementation time of FW_FRSAR, F_FRSAR, GAIN_RATIO_AS_FRS FW_FRSAR algorithm F_FRSA GAIN_RATI R O No C Data set U Filer Wrapper Total algorithm _AS_FRS . procedur procedur algorithm e e [45] 1 Ecoli 336 7 2.38 1.24 3.62 2.86 2.95 Ionospher 2 351 34 12.64 6.92 19.56 14.87 15.04 e 3 Wdbc 569 30 22.15 8.74 30.89 24.12 26.08 4 Wpbc 198 33 8.56 6.28 14.84 9.12 9.88 5 Wine 178 13 0.58 1.22 1.80 0.62 0.74 6 Glass 214 9 0.82 0.66 1.48 0.88 1.02 1902 1018.7 7 Magic04 10 894.26 124.49 914.86 948.16 0 5 Page- 8 5473 10 98.64 22.16 120.80 112.76 126.28 blocks Table 2.6 shows that the implementation time of the FW_FRSAR algorithm is higher than the two F_FRSAR and GAIN_RATIO_AS_FRS filter algorithms because they must implement the classifiers in the wrapper phase. 2.3. Attribute reduction using fuzzy distance In recent years, Nguyen Long Giang et al. have used distance measures to resolve the problem of attribute reduction in the decision table based on approaching conventional rough set [9, 24, 57]. , 65] and inadequate decision table based on approaching tolerance rough set [9, 10, 12, 25, 58]. Based on approaching fuzzy rough set, the team extended the proposed distance measurements into fuzzy distance measurements and had some results in using fuzzy distance measurements to resolve the problem of attribute reduction on the decision table having numeric value region [3, 8, 18]. 10
Continue this research direction, with the objective of finding effective distance measures (with simple calculation formulas) to resolve the problem of attribute reduction, in this section we develop a new fuzzy distance measurement (hereinafter referred to as fuzzy distance) based on the partition distance measurement in works [48]. Using the fuzzy distance developed, we propose a filter-wrapped attribute reduction method in the decision table to improve the classification accuracy and minimize the number of attributes of reduction set. 2.3.1. Develop fuzzy distance between two fuzzy sets Proposition 2.1. Given two fuzzy sets A, B on object set U. then d A, B  A  B  A  B is a   fuzzy distance between A and B . 2.3.2. Develop fuzzy distance between two fuzzy partitions Proposition 2.2. Given decision table DS  U , C  D  with U  x1, x2 ,..., xn  and  R P ,  RQ     are fuzzy partitions derived by two fuzzy equivalent relations R P , RQ on P, Q  C . Then:      D  R P ,  RQ  1 n    xi P   xi Q   xi P   xi Q n 2 i 1      Is a fuzzy distance between  R P and  RQ , called as fuzzy partition distance. Proposition 2.3. Given decision table DS  U , C  D  with U  x1, x2 ,..., xn  and R is fuzzy equivalent relation defined on value region of conditional attribute set, then fuzzy distance between 2 attribute sets of C and C  D defined as below:     D  RC ,  RC D    1 n   xi C   xi C   xi D n 2 i 1  Proposition 2.5. Given   R   P is a fuzzy partition on , then we have: D   R  ,     D   R  ,     1 P P Proposition 2.6. Given decision table DS  U , C  D  with U  x1, x2 ,..., xn  , B  C and R is fuzzy equivalent relation defined on value region of conditional attribute set. Then          D  R B ,  R BD  D  R C ,  R C D  2.3.3. Attribute reduction using fuzzy distance based on approaching filter Definition 2.1. Given decision table DS  U , C  D  with B  C and R is fuzzy equivalent relation defined on value region of conditional attribute set. If       D   R  ,  R  1) D  R B ,  R BD C C D 2) b  B, D   R    ,   R     D   R  ,  R  B b B  b D C CD B is a reduction set of C , based on fuzzy distance. Definition 2.2. Given decision table DS  U , C  D  with B  C and b  C  B . The importance of attribute b to B defined by      SIGB  b   D  R B ,  R BD  D  R Bb ,  R BbD    Importance SIGB  b  denoted classification quality of attribute b to decisive attribute D and used as standard to select attribute of F_FDAR filter algorithm for finding reduction set F_FDAR algorithm (Filter - Fuzzy Distance based Attribute Reduction): filter algorithm for finding reduction set using fuzzy distance. 11
Input: given decision table DS  U , C  D  , fuzzy equivalent relation R defined on conditional attribute set. Output: A reduction set B 1. B   ; D  R B ,  R BD       1 ; 2. Calculate fuzzy partition distance D  RC ,  RC D ;      // Add gradually the greatest important attributes into B     3. While D  R B ,  R BD   D   R  ,  R  do C C D 4. Begin 5. With each a  C  B calculate     SIGB  a   D  R B ,  R BD   D   R Ba  ,  R BaD  6. Select am  C  B so that SIGB  am   Max SIGB  a  ; aC  B 7. B  B  am ; 8. End; //Remove redundant attributes in B if any 9. For each a  B 10. Begin 11.     Calculate D  R Ba ,  RBaD ;  12. If D   R    ,  R     D   R  ,  R  then B  B  a ; B a B  a D C C D 13. End; Return B ; The time complexity of F_FDAR algorithm is O C U  2 2  2.3.4. Attribute reduction using fuzzy distance based on approaching filter-wrapper Consider decision table DS  U , C  D  with C  a1, a2 ,..., am  and R is fuzzy equivalent relation defined on conditional attribute value. Set   D  RC ,  RC D . According to      F_FDAR algorithm, suppose that attributes ai , ai ,... added into empty set based on the greatest 1 2 important value of attribute until exist t 1,2,...m so that     D  Rai1 ,ai2 ,...,ait  ,  Rai1 ,ai2 ,...,ait D    . Finish the algorithm, we have reduction set   B  ai1 , ai2 ,..., ait , the classification accuracy on data set calculated by the classification accuracy on B. On the other hand, according to Clause 2.6 we have    i1 i1   i1 i2   i1 i2  D  Ra  ,  Ra D  D  Ra ,a  ,  Ra ,a D  ...  D  Ra ,...,a  ,  Ra ,...,a D    i1 it   i1 it  with threshold   given, set  Bk  ai1 ,..., aik  to satisfy    D  R Bk ,  R Bk D    and     D  R Bk aik 1  ,  R Bk aik 1 D   . then, Bk is called as approximate reduction set  . If Bk and  Bk  aik 1 ,..., ait  used to develop classifier, publication [91] shows that, the classification 12
accuracy on Bk  ai ,..., ai  is not better than on Bk . Suppose that Bk has better classification k 1 t accuracy Bk  ai ,..., ai  . Then, if select Bk is the result of an algorithm, Bk has higher k 1 t classification accuracy, the number of attributes is smaller, the generalization and performance of the classification algorithms are higher. This leads to a hybrid approach direction for finding approximate reduction set, is a combination of filter and wrapper. The filter method finds the approximate reduction set, wrapper method checks the classification accuracy of approximate reduction set for selecting the greatest accuracy reduction set. With this approach direction, the classification accuracy on the reduction set is higher compared to conventional filter methods. However, the implementation time will be greater because of the implementation of the classifier. The filter-wrapper algorithm finds a approximate reduct using fuzzy distance: FW_FDAR algorithm (Filter-Wrapper Fuzzy Distance based Attribute Reduction): The filter- wrapper algorithm for finding a approximate reduction set using fuzzy distance. Input: Decision table DS  U , C  D  , fuzzy equivalent relation R on conditional attribute value region. Output: Approximate reduction set Bx has the best classification accuracy. // Initialize 1.     B   ; D  R B ,  R BD   1 ; 2. Calculate the fuzzy distance D  RC ,  RC D ;      // filter phase, fins candidates for reduction set // Add gradually the greatest important attributes into B 3.     While D  R B ,  R BD   D   R  ,  R  do C C D 4. Begin 5. With each a  C  B calculate     SIGB  a   D  R B ,  R BD   D   R Ba  ,  R BaD  ; 6. Select am  C  B so that SIGB  am   Max SIGB  a  ; aC  B 7. B  B  am ; 8. End; // Wrapper phase, finds the highest classification accuracy of reduction set 9. Set t  B // t is number of elements of B, B includes attribute strings selected at  each iterative step of While loop, it means that B  ai ,ai , ai ,...,ai , ai ,..., ai  ; 1 1 2 1 2 t  10. Set B1  ai , B2  ai , ai ,..., Bt  ai , ai ,..., ai  1 1 2 1 2 t 11. For j = 1 to t 12. Begin 13. Calculate the classification accuracy on B j by a classifier and using 10-fold method; 14. End 15. Bx  B jo with B jo has greatest classification accuracy. 13
Return Bx ;   Time complexity of FW_FDAR algorithm is O C * U  O  C * T  with O T  is 2 2 complexity of classifier. 2.3.5. Experimental algorithms 1) Experimental objectives 1) FW_FDAR filter-wrap proposed algorithm compared to FPDAR filter algorithm in [18] in terms of implementation time and classification accuracy. 2) FW_FDAR filter-wrapper proposed algorithm compared to FEBAR filter-wrapper algorithm in [91] in terms of implementation time and classification accuracy. 2) Experimental data Table 2.8. Tested data set of FW_FDAR algorithm No. of conditional attributes No. of No. of No. Data set Description class objects All Nominal Real- attribute valued decided attribute 1 Lympho Lymphography 148 18 18 0 2 2 Wine Wine 178 13 0 13 3 3 Libra Libras 360 90 0 90 15 movement 4 WDBC Wisconsin 569 30 0 30 2 diagnostic breast cancer 5 Horse Horse colic 368 22 15 7 2 6 Heart Statlog (heart) 270 13 7 6 2 7 Credit Credit approval 690 15 9 6 2 8 German German credit 1000 20 13 7 2 data 3) Comparison result of classification accuracy Classification accuracy is shown by v   in which v mean accuracy value and  is standard error. Using CART classifier (regression tree) to calculate the classification accuracy in wrapper phase with 10-fold cross check method. Table 2.9. The classification accuracy of FW_FDAR, FEBAR, FPDAR Initial FW_FDAR FEBAR FPDAR accuracy algorithm algorithm algorithm No. Data set [91] [18] C Accuracy B Accuracy B Accuracy B Accuracy 1 Lympho 18 0.776± 4 0.768 ± 4 0.768 ± 6 0.722 ± 0.008 0.085 0.085 0.062 2 Wine 13 0.910 ± 5 0.893 ± 5 0.893 ± 7 0.886 ± 0.066 0.072 0.072 0.058 3 Libra 90 0.566 ± 7 0.658 ± 8 0.605 ± 26 0.556 ± 0.137 0.077 0.103 0.205 4 WDBC 30 0.924 ± 4 0.968 ± 3 0.952 ± 6 0.925 ± 14
0.037 0.058 0.027 0.644 5 Horse 22 0.829 ± 5 0.816 ± 4 0.802 ± 12 0.798 ± 0.085 0.052 0.066 0.058 6 Heart 13 0.744 ± 3 0.803 ± 3 0.803 ± 12 0.752 ± 0.072 0.074 0.074 0.055 7 Credit 15 0.826 ± 3 0.865 ± 2 0.846 ± 14 0.820 ± 0.052 0.028 0.048 0.078 8 German 20 0.692 ± 6 0.716 ± 5 0.702 ± 11 0.684 ± 0.030 0.029 0.043 0.024 The results in Table 2.9 show that the number of attributes of reduction set of the FW_FDAR proposed algorithm is much smaller than FPDAR filter algorithm. The accuracy of FW_FDAR is higher than FPDAR on all data sets. With FEBAR [91] filter-wrapper algorithm using fuzzy -entropy, the number of attributes of reduction set of FW_FDAR is approximate FEBAR, the classification accuracy of FW_FDAR is approximate FEBAR. 3) Comparison result of implementation time Table 2.10. Implementation time of FW_FDAR, FEBAR, FPDAR FW_FDAR algorithm FEBAR algorithm [91] FPDAR algorith No Data Filer Wrapper Total Filer Wrapper Total m [18] . set procedur procedur procedur procedur e e e e 1 Lymph 0.32 0.50 0.82 0.38 0.52 0.90 0.34 o 2 Wine 0.46 1.21 1.67 0.51 1.18 1.69 0.48 3 Libra 46.28 86.18 132,4 55.12 88.26 143.3 48.48 6 8 4 WDBC 20.15 8.74 28.89 26.38 8.22 34.60 22.32 5 Horse 4.85 2.68 7.53 5.26 2.65 7.91 4.98 6 Heart 1.22 1.52 2.74 1.45 1.78 3.23 1.26 7 Credit 16.58 3.42 20.00 19.26 3.98 23.24 18.02 8 German 52.48 8.64 61.12 71.22 8.28 79.50 54.65 Table 2.10 shows that the FW_FDAR algorithm has a significantly shorter implementation time than the FEBAR algorithm [91], especially in the filter procedure for finding reduction set. The reason is that FEBAR algorithm must calculate the fuzzy positive region to determine the coefficient , furthermore, FEBAR algorithm must calculate the complex logarithm formula in the Shannon entropy formula. However, algorithm based on approaching FW_FDAR filter-wrapper and FEBAR [91] have a significantly longer implementation time than algorithm based on approaching FPDAR filter[18] because they must implement the classifier to calculate the accuracy of approximate reduction set in wrapper phase. Chapter 3. INCREMENTAL METHOD OF ATTRIBUTE REDUCTION IN DECISION TABLE CHANGED TO USE FUZZY DISTANCE 3.1. Introduction With the continuous growth in data capacity, decision tables are becoming larger and changed continuously. The application of algorithms for finding reduction set based on 15
conventional approach has many challenges. Therefore, the researchers proposed an incremental calculated approach direction for finding reduction set to minimize the implementation time and be able to implement on large decision tables. In recent years, some research teams have proposed incremental algorithms for finding reduction set on decision tables changed based on approaching fuzzy rough set [15, 16, 97, 99]. Incremental algorithms for finding reduction set on decision tables based on approaching fuzzy rough set mentioned above have significantly smaller implementation time than non- incremental algorithms and can be implemented on large data tables. However, the algorithms mentioned above follow the direction of approaching conventional filter. Thus, the reduction set of algorithms mentioned above is not optimal both in terms of number of attributes and the classification accuracy. In this chapter, the thesis presents the formula for fuzzy distance incremental calculation (proposed in Item 2.3, Chapter 2) in the case of addition and removal of object set. According to the incremental formulas developed, the thesis presents an filter-wrapper incremental algorithm for finding reduction set in the case of addition or removal of object set. The results of this chapter are published in Work no. 7. 3.2. Filter-wrapper incremental algorithm for finding approximate reduction set when adding object set 3.2.1. Incremental formula for fuzzy distance calculation when adding object set Clause 3.1. Given decision table DS  U , C  D  with U  x1, x2 ,..., xn  and R is fuzzy equivalent relation defined on conditional attribute set value region. Suppose that object x added into U . Then, formula for fuzzy partition distance incremental calculation is:             n 21   x 2 DU x  RC ,  RC D  n    DU  RC ,  RC D  n 1 2 C   x C   x D  Clause 3.2. Given decision table DS  U , C  D  with U  x1, x2 ,..., xn  and R is fuzzy equivalent relation defined on conditional attribute set value region. Suppose that object set includes s elements U  xn1, xn2 ,..., xn s  added into U, with MU U ( RC )   pij  , MU U ( RD )  d ij  respectively is fuzzy equivalent relational  n  s  n  s   n  s  n  s  matrix and equivalent relational matrix on C and D then, formula for fuzzy partition distance incremental calculation:           x   2  n  2 s 2   DU U  RC ,  RCD   U D  R C ,  R C  D    xni C   xni D    n  s  i1 n i C ns s 1 with     pn i ,n  j 1  min  pn i ,n  j 1 , d n i ,n  j 1   j i 3.2.2. Filter-wrapper incremental algorithm for finding reduction set when adding object set Clause 3.3. Given decision table DS  U , C  D  with U  x1, x2 ,..., xn  and R is fuzzy equivalent relation defined on conditional attribute set value region, B  C is reduction set based on fuzzy partition distance. Suppose that object set includes s elements U  xn1, xn2 ,..., xn s  added into U . Then we have: 1) If D  xni   d with all i  1..s then 16
         2  n    s 2 2   n i C DU U  RC ,  RC D   U D  R C ,  R C  D  x   xn i C   xn i D 2) If ns  n  s  i1  xni B  xni D with all i  1..s then D   R  ,  R   D   R  ,  R  U U B B D U U C C D Based on Clause 3.3, filter-wrapper incremental algorithm for finding reduction set using fuzzy partition distance when adding object set U implemented as below: IFW_FDAR_AdObj algorithm (Incremental Filter-Wrapper Algorithm for Fuzzy Distance based Attribute Reduction When Add Objects). Input: 1) Decision table DS  U , C  D  with U  x1, x2 ,..., xn  , fuzzy equivalent relation R , reduction set B  C , approximate reduction set B0  C has the best classification accuracy 2) Fuzzy equivalent relational matrix MU ( R B )   pijB  , MU ( RC )   pijC  , MU ( R D )  dij  nn nn nn 3) Added object set U  xn1, xn2 ,..., xns  Output: Approximate reduction set Bbest of DS '  U  U , C  D  Step 1: Initialize 1. T :  // include candidates of the best reduction set 2. Calculate fuzzy equivalent relational matrix on object set U  U MU U ( R B )   pijB  , MU U ( R D )  dij   n  s  n  s   n  s  n  s  Step 2: Check add object set 3. Set X : U 4. For i  1 to s do 5. If  xni B  xni D then X : X  xni  6. If X   then Return B0 // unchanged approximate reduction set 7. Set U : X ; s : U ; //Object set reassigned Step 3: Implement algorithm for finding reduction set 8. Calculate the initial fuzzy partition distance     DU  R B ,  R BD ; DU  RC ,  RC D       9. Calculate the fuzzy partition distance by incremental formula;     DU U  R B ,  R BD ; DU U  RC ,  RC D ;       // filter phase, find candidates for reduction set 10. While DU U  R B ,  R BD       D U U   R  ,  R  do C C D 11. Begin 12. For each a  C  B do 13. Begin 14. Calculate DU U  R Ba ,  R BaD      by incremental formula; 15. Calculate SIGB  a   DU U   R  ,  R   D   R    ,  R    B BD U U B a B a  D 17
16. End; 17. Select a  C  B so that SIGB  am   Max SIGB  a  ; aC  B 18. B : B  am  ; 19. B0 : B0  am  20. T : T  B0 ; 21. End; // Wrapper phase, find reduction set having the highest classification accuracy 22. Set t : T //t is number of elements of T, T includes selected attribute string, it  means that T  B0  ai , B0  ai , ai ,..., B0  ai , ai ,..., ai  ; 1 1 2 1 2 t  23. Set T1 : B0  ai ; T2 : B0  ai , ai ;...; Tt : B0  ai , ai ,..., ai  1 1 2 1 2 t 24. For j = 1 to t 25. Begin 26. Calculate the classification accuracy on T j by a classifier using 10-fold method; 27. End 28. Bbest : T jo with T jo has the greatest classification accuracy. Return Bbest ; Time complexity of IFW_FDAR_AdObj algorithm is      max O B * U *  U  U  , O  C  B  * U *  U  U   O  C  B  * T 2  thus, IFW_FDAR_AdObj incremental algorithm minimizes significantly implementation time complexity, especially in the case U or C is large and B is small. 3.2.3. Experimental algorithm 1) Experimental objectives (1) Evaluate the efficiency of the implementation time of IFW_FDAR_AdObj filter- wrapper incremental algorithm with FW_FDAR and FEBAR [91] non-incremental algorithms. FEBAR is a filter-wrapper algorithm for finding reduction set using fuzzy -entropy in [91]. FW_FDAR is a filter-wrapper algorithm for finding reduction set using the fuzzy distance shown in Chapter 2. (2) Evaluate the efficiency of the classification accuracy of IFW_FDAR_AdObj filter- wrapper incremental algorithm with the IV-FS-FRS-2 [99] and IARM [98] filter incremental algorithms. IV-FS-FRS-2 and IARM are incremental algorithms for finding reduction set when adding object set using distinction relations in fuzzy rough set based on approaching filter. 2) Data set Table 3.1. Data set of IFW_FDAR_AdObj algorithm No. of conditional No. of No. of attributes No. of Data No. of No. Description initial incremental All Nominal Real- classes set objects objects objects attribute valued decided attribute (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) 1 Wine Wine 178 88 90 13 0 13 3 2 Libra Libras 360 180 180 90 0 90 15 18