YOMEDIA
ADSENSE
Privacy preserving multivariate classification based on tree vector product protocol
12
lượt xem 2
download
lượt xem 2
download
Download
Vui lòng tải xuống để xem tài liệu đầy đủ
In this paper, we firstly state the secure three-vector product computation problem and develop protocols called tree-vector products to solve this problem. Our protocols are developed based on some basic protocols and their security properties is validated base on the composition theorem for the semi-honest model [9]. We then use the proposed protocols to address a focus problem of privacy preserving multivariate classification.
AMBIENT/
Chủ đề:
Bình luận(0) Đăng nhập để gửi bình luận!
Nội dung Text: Privacy preserving multivariate classification based on tree vector product protocol
- JOURNAL OF SCIENCE OF HNUE FIT., 2011, Vol. 56, pp. 72-80 PRIVACY PRESERVING MULTIVARIATE CLASSIFICATION BASED ON TREE-VECTOR PRODUCT PROTOCOL Luong The Dung(∗) VietNam Information Security Commission Tran Duc Su Academy of Cryptographic Technique (∗) E-mail: thedungluong@gmail.com 1. Introduction Data mining has emerged as a significant technology for gaining knowledge from vast quantities of data [1]. Data mining technology allows us to analyze a personal data, or a organizational data. However, that creates threats to privacy, this reason might cause an obstruction to data mining collaboration projects. For example, two companies have a huge data set of records of their customers. These companies want to cooperatively conduct mining on their joint data set for their mutual benefit since this collaboration brings them results of mining which are more accurate than results of local data mining. However, each company does not want to, or does not have permission to disclose private information of other company’s customers. The challenge then is whether the two companies above obtain results of mining while still preserving their data secrecy. Recently, there has been growing focus on finding solutions to this problem [4, 5, 11]. In this paper, we firstly state the secure three-vector product computation problem and develop protocols called tree-vector products to solve this problem. Our protocols are developed based on some basic protocols [3, 4, 7, 8] and their security properties is validated base on the composition theorem for the semi-honest model [9]. We then use the proposed protocols to address a focus problem of privacy preserving multivariate classification. 2. Content 2.1. Background and problem statement 2.1.1. Problem statement Problem. Consider a data set D that has m observations, n attributes X1 , X2 , . . . , Xn and a class attribute Y . Where each Xi , (i = 1, . . . , n) takes values as real numbers, Y takes values as category data. The data set D is vertically 72
- Privacy preserving multivariate classification based on tree-vector product protocol partitioned over K parties where each party Pi (i = 1, 2, . . . , K) corresponds to a subset Di with some attributes of D. In other words, each Pi can only observe the values of some attributes of D for the same set of m observations, only one or the subset of the parties has attribute Y . We use the notation D = (D1 : D2 : . . . : DK : Y ) to present the union of these K subsets. The target of this work is to find solutions to build a Multivariate Classification model according to the above described circumstance, without disclosing the private data of any parties to the others. Previous works. Privacy preserving Multivariate Classification approach for the vertically partitioned model proposed in [2, 12]. This approach is based on secure matrix product computation techniques, it was showed efficient in cost communica- tion. However, this approach has a limitation that it requires all parties participating in the protocol knowing class attribute Y . We address the problem that only the subset of the parties knows the class attribute and the class attribute is sensitive and it needs be protected. 2.1.2. Cryptography tools Secure scalar product computation protocol. For example there are two parties Alice and Bob, Alice has a vector X = (x1 , .., xn ) and Bob has a vector Y = (y1 , . . . , yn ). The scalar product, or inner product, of two vectors Y and Y is defined as Pn hX, Y i = xi yi . i=1 The goal of secure scalar product protocol is to compute the scalar product of two vectors X and Y with no party disclose its private data (vector) to other parties. Secure scalar product protocols are an important primitive for privacy- preserving data mining. Several such protocols have been proposed (e.g. [3], [4], [5], etc.) with varying degrees of security and cost communication. A nice overview of the problem and the security properties of some solutions appears in [6]. Oblivious polynomial evaluation. The problem of “oblivious polynomial evalu- ation” was first considered in [7]. As with oblivious transfer, this problem involves a sender and a receiver. The sender’s input is a polynomial Q of degree k over some finite field F and the receiver’s input is an element z ∈ F (the degree k of Q is public). The protocol is such that the receiver obtains Q(z) without learning anything else about the polynomial Q, and the sender learns nothing. An efficient solution to this problem was presented in [7]. The overhead of that protocol is O(k) exponentiations (using the method suggested in [8] for doing a 1-out-of-N oblivious transfer with O(1) exponentiations). The semi-honest model. In the semi-honest model, each party participating in the protocol have to follow the rules using its correct input, and it can’t use what it sees during execution of the protocol to compromise security. This model is 73
- Luong The Dung and Tran Duc Su reasonable for many real situations because parties who want to mine data for their mutual benefit will follow the protocol to get correct results. The definition of secure two-party computation in the semi-honest model is stated in [9]. Basically the definition states that a computation is secure if the view of each party during the execution of the protocol can be effectively simulated by the input and the output of the party. This is not quite the same as saying that private information is protected because it is possible to have privacy breaches by inferring from the final result and local information. In this paper, we also use the Composition Theorem for the semi-honest model. The detail of its discussion and the proof can be seen in [9]. Theorem 2.1. (The Composition Theorem) Suppose that g is privately reducible to f and that there exists a protocol for privately computing f . Then there exists a protocol for privately computing g. 2.2. Tree-vector Product For next works, in this section we state a secure three-vector product compu- tation problem and propose protocols to address this problem. Support that three parties Pa , Pb and Pc have three vectors X = (x1 , . . . , xn ), Y = (y1 , . . . , yn ) and Z = (z1 , . . . , zn ) respectively. We call the three-vectors scalar product of X, Y and Pn Z as hX, Y, Zi = x − iyi zi . Our goal is to design a protocol for secure three- i=1 vectors product computation with no party disclose its private data to other party, all parties are not able to learn anything other than the final result. 2.2.1. Protocol 1 To address the above problem, we built three-vector product protocols from the two basic protocols: secure scalar product and secure polynomial comes evalu- ation. In actual implementation, these basic protocols have proved be secure and efficient, hence some authors have used these protocols as an important prim- itive to design their protocols. Let R = (r1 , . . . , rm ) be a random vector. Let (Q1 (z), Q2 (z), . . . , Qm (z)) = (x1 z + x2 z + . . . + xm z + rm ) be n degree 1 poly- nomials, we call it be a linear polynomial vector. we design the secure three-vector product protocol as below: Input. Three parties Pa , Pb , Pc have three vectors X, Y, Z respectively. Output. hX, Y, Zi. 1. Pa generates a random vector and then defines a linear polynomial vector where be a linear polynomial. Each element of vector R follows a uniform distribution over a user-defined field. 2. Pa and Pb engage in a private evaluation of each Qi (yi ), (i = 1, . . . , m)), in which Pb obtains Q = (Q1 (y1 ), Q2 (y2 ), , . . . , Qm (ym )) where Qi (yi ) = xi yi + ri 74
- Privacy preserving multivariate classification based on tree-vector product protocol 3. Pa and Pc execute the secure product protocol with Pa inputting R and in- putting Z, in which Pc obtains 4. Pb and Pc execute the secure product protocol with Pb inputting and Pc in- putting Z, in which Pb obtains 5. Pb computes hX, Y, Zi = hQ, Zi − hR, Zi. 6. Pb sends to Pa and Pc Theorem 2.2. Protocol 1 privately computes the shares of the three-vectors Product value Proof. In this section, we show that Protocol 1 allows us to obtain the three-vectors Scalar Product values and privacy-preserving. We have Q = (Q1 (y1 ), Q2 (y2 ), , . . . , Qm (ym )) = (x1 y1 + r1 , . . . , xm ym + rm ), Pm P so hQ, Zi = (xi yi zi + ri zi ) and hR, Zi = 6mi=1 ri zi , i=1 P m therefor, hX, Y, Zi = hQ, Zi − hR, Zi = xi yi zi , this is the three-vectors i=1 product value. In order to show that a protocol is secure in the semi-honest model, we simply need to apply the composition theorem, so we need to show that each step in Protocol 1 is secure. We note that Protocol 1 requires the secure scalar product and secure poly- nomial evaluation operation, where there already exists many protocols that are correct and secure ([3,4,5]). During these execution of this protocols, the parties participating in the protocols are not able to learn anything other than the final result. In the protocol 1, the only communication taking place is at steps 2, 3 and 4. In Step 2, parties Pa and Pb engage in a private evaluation of each Qi (yi ), (i = 1, . . . , m), neither party share their private, only Pb obtains the vector Q that each Qi (yi ) = xi yi + ri is a random number from a uniform distribution over a defined field. In step 3, Pa and Pc execute the secure product protocol, each party only ob- tains hR, Zi without any other useful information. The results of the scalar product protocol are randomly shared, which can be simulated by both Pa and Pc as shown in (Lemma 4.1 [11]) In step 4, similarly, Pb and Pc only obtain the value hQ, Zi without knowing other useful information from the other party. Protocol 1 can also be proved by using a simulation method [10]. The basic idea is to show the view of each party during the execution of the protocol that can 75
- Luong The Dung and Tran Duc Su be effectively simulated by giving the input and output of that party. Here, we show that both parties are not able to learn anything but the final result. Remark. We should note that using the same random vector R in many iter- ations of Protocol 1 will cause privacy breaches. For example, assume that Pa has a vector X, Pb has two vectors Y and Y ′ , and Pc has a vector Z. The parties need to execute two iterations of Protocol 1 to compute hX, Y ′ , Zi and hX, Y, Zi. If we use the same vector R in two above iterations then each xi of Pa , Pb can establish an equation system including two equations: xi yi + ri = qi and xi yi′ + ri = qi′ (qi and qi′ were determined), so Pb can get xi . Thus, to prevent privacy breaches we need to generate a different vector R in each iteration of Protocol 1. Cost Communication. Let π(m) be an expression for the communication cost of some secure scalar product protocols where is the number of elements of the vectors participating in the protocol. Let π ′ (m) be an expression for the communication cost of the private polynomial evaluation protocol in a private evaluation of the linear polynomial vector. In Protocol 1, the communication primarily occurs at steps 2, 3 and 4, so the communication cost of Protocol 1 is 2π(m) + π ′ (m) 2.2.2. Protocol 2 The protocol 1 is secure and it also protects each party’s private information against collusion of other dishonest parties. However, in some determined conditions, we can support that there are no collusions among parties. Thus we propose a secure three- vectors scalar product protocol that can have lower communication cost other than that of Protocol 1. The idea is that, because Pa shares the vector R with Pc , we can ignore the step of computing hR, Zi to reduce communication costs. However, the drawback of this protocol is that it does not protect private data of Pa against collusion between Pb and Pc . Input. Three parties Pa , Pb and Pc have three vectors X, Y, Z respectively. Output. hX, Y, Zi. 1. Pa and Pc jointly generate a random vector R = (r1 , . . . , rm ). Each element of vector R follows a uniform distribution over a user-defined field. 2. Pa generates a random vector a linear polynomial vector Q = (Q1 (z1 ), Q2 (z2 ), , . . . , Qm (zm )) where Qi (z) = xi z + ri be a linear polyno- mial. 3. Pa and Pb engage in a private evaluation of each Qi (z), in which Pb obtains Q = (Q1 (y1 ), Q2 (y2 ), , . . . , Qm (ym )) where Qi (z) = xi yi + ri . 4. Pb and Pc execute the secure scalar product protocol with Pb inputting Q and Pc inputting Z, in which they obtain hQ, Zi. 5. Pc computes hR, Zi and then computes hX, Y, Zi = hQ, Zi − hR, Zi. 76
- Privacy preserving multivariate classification based on tree-vector product protocol 6. Pc sends hX, Y, Zi to Pa and Pb Theorem 2.3. Protocol 2 privately computes the shares of the three-vector product value assuming non-collusion. Proof. Similar to the proof of Protocol 1, we can also show that Protocol 2 is privacy-preserving if there is no collusion. Because in Protocol 2, the only com- munication taking place is at steps 3, 4. In step 3, the secure polynomial evaluation protocol executed that only Pb obtains a value random from the defined field. The communication occurs at step 4 with the invocation of the scalar product protocol. The results of the scalar product protocol are random shares, which can be simu- lated as shown in (Lemma 4.1 [11]). Thus, Protocol 1 can be simulated, with the composition theorem being applied to the scalar product protocol and the polyno- mial evaluation protocol. However, Protocol 2 can cause privacy breaches when Pb enters into collusion with Pc , then Pc can share the vector R with Pb , so Pb can get vector X of Pa . Communication cost. In Protocol 2, the communication primarily occurs at steps 3 and 4, so the communication cost of Protocol 1 is π(m) + π ′ (m). Thus, the communication cost of Protocol 2 is lower than that of Protocol 1. The reason for that is because of Pa sharing R with Pc , Pa and Pc do not have to use a scalar product protocol for computation hR, Zi. 2.2.3. Protocol 3 In this section, we consider a case that three vectors X, Y, Z belong to only two parties. For example, X and Y belong to Pa , and Z belongs to Pb . To compute hX, Y, Zi, Pa generates R = (x1 y1 , . . . , xm ym ), then Pa and Pb execute the secure product protocol with Pa inputting R and Pb inputting Z, in which they obtain hR, Zi. It should be noted that hX, Y, Zi = hR, Zi. Security properties of this protocol derive from the secure scalar product protocol. 2.3. Privacy Preserving Multivariate Classification 2.3.1. Multivariate Classification In this section, we introduce the Multivariate Classification method and ana- lyze this method to find the target formula that need be securely computed. Classification is one of the most important technologies of data mining that has wide applications in the various areas. The goal of classification is to build a model which can predict the value of one variable, based on the known values of the other variables. There are various classification methods, such as Statistical methods, Support vector machine, and Decision tree [1]. In this paper we consider a method based on a statistical analysis presented in [2]. An observation is said belonging a class k if the attributes vector of that observation is the closest to the centroid of class k. The Mahalanobis distance of an 77
- Luong The Dung and Tran Duc Su observation in a class indicates how far the observation is from the centroid of all observations in that class. Let Sk represents the data set consisting of the vectors of all the observations in class k. Let S k be the vector of means of observations in class k. Let Sk (j) represents the j row of matrix S k , and Sbk be a matrix where Sbk (j) = Sk (j) − S k . The th Mahalanobis distance dk between an observation V (its attributes vector is V ) and the centroid of class k can be computed as the following: d2k = (V − S k )T Ck−1 (V − S k ) + ln |Ck | 1 where Ck is the covariance matrix in class k: Ck = hSˆk , Sˆk i m−1 Assume class attribute Y takes values in a discrete domain including *********** values. After computing d1 , . . . , dL , we can conclude that the observation V belongs to class i if di = min{d1 , d2 , ..., dL }. Therefore, in order to determine a class for a new observation, we need to compute Ck and S k . 2.3.2. Privacy Preserving Multivariate Classification Now we address the privacy preserving classification problem, assume the data set D is vertically partitioned on K parties. We need to design a protocol that allows each party to predict a class label for a new observation based on the joint data set of parties with no party disclosing its private data to other parties. To predict class for a new observation, the parties need to know d1 , . . . , dL . So they need to securely compute the covariance matrix Ck and the mean vector S k for each class k. In this section, we analyze steps to compute Ck and S k . So finally, we propose a protocol through analysis steps. Xi = (x1 , x2 , ..., xm ) be the column vector ith of the data set D (i = 1, . . . , n). 1 We denote Yk = (y1 , y2, ..., ym ) be a vector that yi = if the row ith of the data mk set belongs to the class k else yi = 0, where mk be the number of rows of the data k set belong to class k. Let X i be the mean of the vector Xi in the class k, then: k P Xi = m j=1 xj yj = hXi , Yk i k k k It should be noted that S k = (X 1 , X 2 , .., X n ). Therefor, to compute S k , we have to address the computation of a scalar product of two vector Xi and Yk . if Xi and Yk belong to the same party, then this work is locally completed without disclosing private data of any parties. If Xi and Yk belong to two different parties, then we use the secure scalar product protocol. To continue, we have to address the computation of Ck . Let X ˆ i = (ˆ x1 , x ˆ2 , ..., xˆm ) k be a vector that each element xˆi = xi − X i . we denote Yk′ = (y1′ , y2′ , ..., ym ′ ) be a 1 vector that each yi′ = if the row i of the data set belongs to class k else th mk − 1 yi′ = 0. Each the element (I, j)th of Ck is determined by 78
- Privacy preserving multivariate classification based on tree-vector product protocol m X (i) (j) ˆi, X ˆ j , Yk ′ i Ck (i, j) = xˆt xˆt yk′ = hX t=1 We assume without loss of generality that only one party Pc (c ∈ {1, 2, . . . , K}) has the class attribute Y . We can divide this into three possibilities: - Three vectors Xi , Xj and Yk belong to the same party, then C(I, j) is locally computed without disclosing any private information. - Three vectors Xi , Xj and Yk belong to two parties. We can use Protocol 3 to address this case. - Xi , Xj and Yk belong to three different party. We address this problem by Protocol 1 or Protocol 2. The above analysis has showed that we can securely compute the mean vector and the covariance matrix by proposed protocols in section 3. Based on the compo- sition theorem and derivation of proposed protocols, our computations can privately make the shares of the mean vector and the covariance matrix for each class k. 3. Conclusion The major contributions of this paper are a privacy preserving multivariate classification for vertically partitioned data, in which only subsets of parties know class attribute. To address the above problem, three- vector product protocol was proposed in this paper. There is another direction for our future works that devel- oping multiple-vectors products protocol is a non-trivial extension, especially if we consider collusion between parties as well. This goal is to develop other efficient pri- vacy preserving data mining methods such as Association Rules, Network Bayesian, etc. REFERENCES [1] J. Han and M. Kamber. Data Mining Concepts and Techniques. Morgan Kauf- mann Publishers, 2001. [2] W.L. Du, Yunghsiang S. Han and S. Chen. Privacy-preserving multivariate sta- tistical analysis: Linear regression and classification. Proceedings of the Fourth SIAM [3] International Conference on Data Mining, pages 222-233, 2004. W. Du and M. Atallah. Privacy-preserving cooperative statistical analysis. In Proc. of the 17th Annual Computer Security Applications Confer- ence, pp. 103-110, 200 [4] J. Vaidya and C. Clifton. Privacy preserving association rule mining in vertically partitioned data. In Proc. of the Eighth ACM SIGKDD Inter- national Conference on Knowledge Discovery and Data Mining, pages 639-644. ACM Press, 2002. [5] R. N. Wright and Z. Yang. Privacy-preserving Bayesian network structure com- putation on distributed heterogeneous data. In Proc. of the Tenth ACM SIGKDD 79
- Luong The Dung and Tran Duc Su International Conference on Knowledge Discovery and Data Mining, pages 713- 718. ACM Press, 2004. [6] Bart Goethals, Sven Laur, Helger Lipmaa, and Taneli Mielikainen. On private scalar product computation for privacy-preserving data mining. In Proc. of the Seventh Annual International Conference in Information Security and Cryptol- ogy, LNCS. Springer-Verlag, 2004. to appear. [7] M. Naor and B. Pinkas, Oblivious Transfer and Polynomial Evaluation, Proceed- ings of the 31th Annual Symposium on the Theory of Computing (STOC), ACM, 1999, pp. 245-254. [8] M. Naor and B. Pinkas, Efficient Oblivious Transfer Protocols, Proceedings of 12th SIAM Symposium on Discrete Algorithms (SODA), January 7-9 2001, Wash- ington DC, pp. 448-457. [9] O. Goldreich. Secure multi-party computation, Sept. 1998. (working draft). [10] O. Goldreich. The Foundations of Cryptography. Cambridge University Press, 2004. [11] Jaideep Vaidya, Chris Clifton. Privacy Preserving Naive Bayes Classifier for Vertically Partitioned Data. Proceedings of the Fourth SIAM International Con- ference on Data Mining/ Michael W. Berry, Chandrika Kamath, Umeshwar Dayal, David Skillicorn, 2004 [12] Luong, T.D., Ho, T.B. (2008). Privacy Preserving for Multivariate Outlier De- tection, Third International Conference on Knowledge, Information and Creativ- ity Support Systems (KICSS 2008), December 22-23, Hanoi, 7-16. [13] Agrawal, R. and Srikant, R. 2000. Privacy-preserving data mining. SIGMOD Rec. 29, 2 (Jun. 2000), pp. 439-450. ABSTRACT Privacy preserving multivariate classification based on tree-vector product protocol Data mining has emerged as a significant technology for gaining knowledge from vast quantities of data. However, many databases can contain privacy infor- mation. Thus, there are several data mining situations where privacy and security issues arise. The challenge then is how we can obtain results of mining while still preserving the secrecy of data. Recently, there has been growing focus on finding solutions to this problem. In this paper, we state a problem of secure tree-vector product computation and propose protocols to solve this problem. Afterward we use the proposed protocols to address a focus problem of privacy preserving multivariate classification. 80
ADSENSE
CÓ THỂ BẠN MUỐN DOWNLOAD
Thêm tài liệu vào bộ sưu tập có sẵn:
Báo xấu
LAVA
AANETWORK
TRỢ GIÚP
HỖ TRỢ KHÁCH HÀNG
Chịu trách nhiệm nội dung:
Nguyễn Công Hà - Giám đốc Công ty TNHH TÀI LIỆU TRỰC TUYẾN VI NA
LIÊN HỆ
Địa chỉ: P402, 54A Nơ Trang Long, Phường 14, Q.Bình Thạnh, TP.HCM
Hotline: 093 303 0098
Email: support@tailieu.vn