Spam email filtering based on machine learning

Chia sẻ: Thi Thi | Ngày: | Loại File: PDF | Số trang:5

Thêm vào BST

Báo xấu

24
lượt xem 3
download

Download Vui lòng tải xuống để xem tài liệu đầy đủ

In the paper, we are going to present a spam email filtering method based on machine learning, namely Naïve Bayes classification method because this approach is highly effective. With the learning ability (self improving performance), a system applied this method can automatically learn and ameliorate the effect of spam email classification. Simultaneously, the ability of system’s classification is also updated by new incoming emails, therefore, it is very difficult for spammers to overcome the classifier, compared to traditional solutions.

Chủ đề:

Bình luận(0) Đăng nhập để gửi bình luận!

Lưu

Nội dung Text: Spam email filtering based on machine learning

Trịnh Minh Đức Tạp chí KHOA HỌC & CÔNG NGHỆ 118(04): 133 - 137 7 SPAM EMAIL FILTERING BASED ON MACHINE LEARNING Trinh Minh Duc* College of Information and Comunication Technology – TNU SUMMARY In the paper, we are going to present a spam email filtering method based on machine learning, namely Naïve Bayes classification method because this approach is highly effective. With the learning ability (self improving performance), a system applied this method can automatically learn and ameliorate the effect of spam email classification. Simultaneously, the ability of system’s classification is also updated by new incoming emails, therefore, it is very difficult for spammers to overcome the classifier, compared to traditional solutions. Key words: Machine learning, email spam filtering, Naïve Bayes. INTRODUCTION* The Email classification is actually the twoclass text classification problem, that is: the early dataset consists of spam and non-spam emails, the texts to be classified as the emails are sent to inbox. The output of the classification process is to determine the class label for an email – belonging to either one of the two classes: spam or non-spam . The general model of the spam email classification problem can be discribed as follows: The categorization process can be divided two phases: The training phase: The input of this phase is the set of spam and non-spam emails. The output is the trained data applied a suitable classification method to serve for the classification period. The classification phase: The input of this phase is an email, together with the trained data. The output is the classification result of the email: spam or non-spam. The rest of this paper is organized as follows. In Sect. 2, we formulate Naïve Bayes classification method and our solution. In Sect. 3, we show experimental results to evaluate the efficiency of this method. Finally, in Sect. 4, we conclude by showing possible future directions. NAïVE BAYES METHOD [4] Figure 1. The spam email classification model * Tel: 0984215060; Email: duchoak15@gmail.com CLASSIFICATION Naïve Bayes method is a supervised learning classification method, and based on probability, based on a probability model (function). The classification process is based on the probability values of the likelihood of the hypotheses. Classification technique of Naïve Bayes is based on Bayes theorem and is particularly suitable for the cases whose input size is large. Although Naïve Bayes is 133 Trịnh Minh Đức Tạp chí KHOA HỌC & CÔNG NGHỆ quite simple, its classification capability is much better than other complex methods. Due to the statistically dependent relaxation hypotheses, Naïve Bayes method considers the attributes conditionally independent with each other. 118(04): 133 - 137 Naïve Bayes classifying algorithm can be described succinctly as follows: The learning phase (given a training set). For each classification (i.e., class label) ci ∈ C Classification problem estimate the priori probability P( ci). For each A training set D, where each training instance x is represented as a n-dimensional attribute attribute value xj, estimate the probability of vector x = (x1, x2, ..., xn). A pre-defined set of classes C={c1, c2 ..., cm}.Given a new instance z, which class should z be classified into? Probability P(ck | z) is called the probability that a new instance z likely belonging to the class ck is calculated as follows: P(xj|ci) The classification phase (given a new instance). For each classification ci ∈ C, compute the fomula: n P(ci ).∏ P(x j | ci ) c = arg max P(ci | z) ci∈C j=1 c = arg max P(ci | z1 , z 2 ,..., z n ) Select the most probable classification c* ci∈C P(z1 , z 2 ,..., z n | ci ).P(ci ) c = arg max P(z1 , z 2 ,..., z n ) ci∈C ci∈C P(z1 , z 2 ,..., z n | ci ).P(ci ) ( P(z1 , z 2 ,..., z n ) is the same for all classes) Assumption in Naïve Bayes method: The attributes are conditionally independent given classification: n P(z1 , z 2 ,..., z n | ci ) = ∏ P(z j | ci ) j=1 Naïve Bayes classifier finds the most probable class for z: ci∈C j=1 j | ci ) ∏ P(z cNB = arg max P(ci ). j=1 1. What happens if no training instances associated with class ci have attribute value xj? P( xj | ci ) = 0, and hence: n P(ci ).∏ P(x j | ci ) = 0 j=1 Solution: to use a Bayesian approach to estimate P( xj | ci ) P(x j | ci ) = n(ci , x j ) + mp n(ci ) + m n(ci ) : number of training instances n ci∈C n ∏ P(x c* = arg max P(ci ). There are two issues we need to solve: c= arg max 134 that attribute value given classification ci: j | ci ) associated with class ci Trịnh Minh Đức Tạp chí KHOA HỌC & CÔNG NGHỆ 118(04): 133 - 137 n(ci , x j ) : number of training instances efficiency of the classification. An email associated with class ci that have attribute consists of value xj. title, content, attachment or non… A simple p: a prior estimate for P(x j | ci ) → Assume uniform priors: p = 1 , if attribute X k has k possible values m: a weight given to prior → To augment the n(ci) actual observations by an additional m virtual samples distributed according to p. a lot of charateristics such as: example: if we know that 95% HTML emails is spam, and we receive a HTML email, thus being able to base oneself on this prior probability in order to compute the probability of email that we receive is spam, if this probability is greater than the 2. The limit of precision in computers’ probability given non-spam, it can be computing capability concluded that the email is spam, however, P(xj | ci) < 1, for every attribute value xj and this conclusion is not very accurate. However, class ci. So, when the number of attribute the more if we know much information, the values is very large greater the probability of correct classification is. To obtain prior probabilities, using Naïve Bayes method to train the set of early Solution: to use a logarithmic function of template emails, then using these probabilities to classify a new email. The probability probability c* = arg max ci∈C calculation will be based on Naïve Bayes formula. With the obtained probability values, we compare them with each other. If spam In the spam email classification problem, probability is greater than non-spam, then we each sample that we consider is an email. The can conclude that the email is spam, the set of classes that each email can belong to opposite is non-spam. [5] C={ spam, non-spam}. EXPERIMENTAL RESULTS When we receive an email, if we do not know We have implemented a test which applied any information about it, it’s so hard to decide Naïve Bayes method in email classification. exactly this email is spam or non-spam. If we The total number of emails in the sample have more certain characteristics or attributes dataset is 4601, including 1813 spam emails of (accounting for 39.4%). This dataset which can an email, then we can improve the 135 Trịnh Minh Đức Tạp chí KHOA HỌC & CÔNG NGHỆ 118(04): 133 - 137 be downloaded at http://archive.ics.uci.edu/ml/ n N −> N is the number of non-spam emails datasets/Spambase which the filter recognizes as non-spam N N is the total number of non-spam is called Spambase. This dataset is divided into two disjoint subsets: the training set D_train (accounting for 66.7%) – for training the system and the test set D_test (33.3%) – for evaluating the trained system. In order to evaluate a machine learning system’s performance, we often use some measures such as: Precision (P), Recall (R), Accuracy rate (Acc), Error rate (Err), F1measure. emails NS is the total number of spam emails Experimental results on the Spambase dataset We present test results for two options of the division of the Spambase dataset: Experiment 1: divide the original Spambase dataset with a proportion k1 = 2 2 , in that, 3 3 the dataset for training and the remaining for testing. Experiment 2: divide the original Spambase dataset with a proportion k2 = Formulas to compute these measures as follows: n S −>S n S −>S + n N −>S P= n S−>S n S−>S + n S−> N R= Acc = Err = F1 = n N −> N + n S−>S N N + NS n N −>S + n S−> N N N + NS 2.P.R 2 = P+R 1 + 1 P R 9 , in that, 10 9 the dataset for training and the remaining 10 for testing. Table 1. Testing results n S−>S n S−> N n N −> N n N −>S Recall Precison Acc Err F1-measure Experiment 1 486 Experiment 2 180 119 2 726 276 204 3 80.33% 70.43% 78.96% 21.04% 75.05% 98.90% 98.36% 98.92% 1.18% 98.63% Where: The testing result in this experiment 2 have very high accuracy (approximately 99%). n S −>S is the number of spam emails which Conclusion the filter recognizes as spam n S −> N is the number of spam emails In this paper, we have examined the effect of which the filter recognizes as non-spam n N −>S is the number of non-spam emails a classifier which has a self-learning ability to which the filter recognizes as spam 136 the Naïve Bayes classification method. This is improve classification performance. Naïve Trịnh Minh Đức Tạp chí KHOA HỌC & CÔNG NGHỆ 118(04): 133 - 137 Bayes classifier proved suitable for email REFERENCES classification problem. Currently, we are [1]. Jonathan A.Zdziarski, Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification - Press 2006. [2]. Mehran Sahami. Susan Dumais. David Heckerman. Eric Horvitz., A Bayesian Approach to Filtering Junk E-Mail. [3]. Sun Microsystem, JavaMail API Design Specification Version 1.4. [4]. T. M. Mitchell. Machine Learning. McGrawHill, 1997. [5]. Lê Nguyễn Bá Duy , Tìm hiểu các hướng tiếp cận phân loại email và xây dựng phần mềm mail client hỗ trợ tiếng việt, Đại học khoa học tự nhiên, Tp.Hồ Chí Minh, 2005. continuing to build standard as well as training samples and can adjust the Naïve Bayes algorithm to improve the accuracy of classification. In the near future, we will build a standard training and testing data system for both English and Vienamese. This is a big problem and need to focus more effort. TÓM TẮT LỌC THƯ RÁC DỰA TRÊN HỌC MÁY Trần Minh Đức* Trường ĐH Công nghệ thông tin và Truyền thông – ĐH Thái Nguyên Trong bài báo này tôi giới thiệu phương pháp phân loại thư rác dựa trên học máy, vì cách tiếp cận này có hiệu quả cao. Với khả năng học (tự cải thiện hiệu năng), thì hệ thống có thể tự động học và cải thiện được hiệu quả phân loại thư rác. Đồng thời, khả năng phân loại của hệ thống cũng sẽ liên tục được cập nhật theo những mẫu thư rác mới và vì vậ y, sẽ rất khó để các spammers vượt qua được, so với các cách tiếp cận truyền thống khác. Từ khóa: Học máy, lọc thư rác, Naïve Bayes. Ngày nhận bài: 13/3/2014; Ngày phản biện: 15/3/2014; Ngày duyệt đăng: 25/3/2014 Phản biện khoa học: TS. Trương Hà Hải – Trường ĐH CNTT&TT – ĐH Thái Nguyên * Tel: 0984215060; Email: duchoak15@gmail.com 137