intTypePromotion=1
zunia.vn Tuyển sinh 2024 dành cho Gen-Z zunia.vn zunia.vn
ADSENSE

Spam email filtering based on machine learning

Chia sẻ: Thi Thi | Ngày: | Loại File: PDF | Số trang:5

24
lượt xem
3
download
 
  Download Vui lòng tải xuống để xem tài liệu đầy đủ

In the paper, we are going to present a spam email filtering method based on machine learning, namely Naïve Bayes classification method because this approach is highly effective. With the learning ability (self improving performance), a system applied this method can automatically learn and ameliorate the effect of spam email classification. Simultaneously, the ability of system’s classification is also updated by new incoming emails, therefore, it is very difficult for spammers to overcome the classifier, compared to traditional solutions.

Chủ đề:
Lưu

Nội dung Text: Spam email filtering based on machine learning

Trịnh Minh Đức<br /> <br /> Tạp chí KHOA HỌC & CÔNG NGHỆ<br /> <br /> 118(04): 133 - 137<br /> <br /> 7<br /> <br /> SPAM EMAIL FILTERING BASED ON MACHINE LEARNING<br /> Trinh Minh Duc*<br /> College of Information and Comunication Technology – TNU<br /> <br /> SUMMARY<br /> In the paper, we are going to present a spam email filtering method based on machine learning,<br /> namely Naïve Bayes classification method because this approach is highly effective. With the<br /> learning ability (self improving performance), a system applied this method can automatically<br /> learn and ameliorate the effect of spam email classification. Simultaneously, the ability of system’s<br /> classification is also updated by new incoming emails, therefore, it is very difficult for spammers<br /> to overcome the classifier, compared to traditional solutions.<br /> Key words: Machine learning, email spam filtering, Naïve Bayes.<br /> <br /> INTRODUCTION*<br /> The Email classification is actually the twoclass text classification problem, that is: the<br /> early dataset consists of spam and non-spam<br /> emails, the texts to be classified as the emails<br /> are sent to inbox. The output of the<br /> classification process is to determine the class<br /> label for an email – belonging to either one of<br /> the two classes: spam or non-spam<br /> .<br /> The general model of the spam email<br /> classification problem can be discribed as<br /> follows:<br /> <br /> The categorization process can be divided<br /> two phases:<br /> <br /> The training phase: The input of this phase is<br /> the set of spam and non-spam emails. The<br /> output is the trained data applied a suitable<br /> classification method to serve for the<br /> classification period.<br /> The classification phase: The input of this<br /> phase is an email, together with the trained<br /> data. The output is the classification result of<br /> the email: spam or non-spam.<br /> The rest of this paper is organized as follows.<br /> In Sect. 2, we formulate Naïve Bayes<br /> classification method and our solution. In<br /> Sect. 3, we show experimental results to<br /> evaluate the efficiency of this method.<br /> Finally, in Sect. 4, we conclude by showing<br /> possible future directions.<br /> NAïVE<br /> BAYES<br /> METHOD [4]<br /> <br /> Figure 1. The spam email classification model<br /> *<br /> <br /> Tel: 0984215060; Email: duchoak15@gmail.com<br /> <br /> CLASSIFICATION<br /> <br /> Naïve Bayes method is a supervised learning<br /> classification method, and based on<br /> probability, based on a probability model<br /> (function). The classification process is based<br /> on the probability values of the likelihood of<br /> the hypotheses. Classification technique of<br /> Naïve Bayes is based on Bayes theorem and<br /> is particularly suitable for the cases whose<br /> input size is large. Although Naïve Bayes is<br /> 133<br /> <br /> Trịnh Minh Đức<br /> <br /> Tạp chí KHOA HỌC & CÔNG NGHỆ<br /> <br /> quite simple, its classification capability is<br /> much better than other complex methods. Due<br /> to the statistically dependent relaxation<br /> hypotheses, Naïve Bayes method considers<br /> the attributes conditionally independent with<br /> each other.<br /> <br /> 118(04): 133 - 137<br /> <br /> Naïve Bayes classifying algorithm can be<br /> described succinctly as follows:<br /> The learning phase (given a training set). For<br /> each classification (i.e., class label) ci ∈ C<br /> <br /> Classification problem<br /> <br /> estimate the priori probability P( ci). For each<br /> <br /> A training set D, where each training instance<br /> x is represented as a n-dimensional attribute<br /> <br /> attribute value xj, estimate the probability of<br /> <br /> <br /> <br /> vector x = (x1, x2, ..., xn). A pre-defined set of<br /> classes C={c1, c2 ..., cm}.Given a new instance<br /> z, which class should z be classified into?<br /> Probability P(ck | z) is called the probability<br /> that a new instance z likely belonging to the<br /> class ck is calculated as follows:<br /> <br /> P(xj|ci)<br /> The classification phase (given a new<br /> instance). For each classification ci ∈ C,<br /> compute the fomula:<br /> n<br /> <br /> P(ci ).∏ P(x j | ci )<br /> <br /> c = arg max P(ci | z)<br /> ci∈C<br /> <br /> j=1<br /> <br /> c = arg max P(ci | z1 , z 2 ,..., z n )<br /> <br /> Select the most probable classification c*<br /> <br /> ci∈C<br /> <br /> P(z1 , z 2 ,..., z n | ci ).P(ci )<br /> c = arg max<br /> P(z1 , z 2 ,..., z n )<br /> ci∈C<br /> ci∈C<br /> <br /> P(z1 , z 2 ,..., z n | ci ).P(ci )<br /> ( P(z1 , z 2 ,..., z n ) is the same for all classes)<br /> Assumption in Naïve Bayes method: The<br /> attributes are conditionally independent given<br /> classification:<br /> n<br /> <br /> P(z1 , z 2 ,..., z n | ci ) = ∏ P(z j | ci )<br /> j=1<br /> <br /> Naïve Bayes classifier finds the most<br /> probable class for z:<br /> <br /> ci∈C<br /> <br /> j=1<br /> <br /> j<br /> <br /> | ci )<br /> <br /> ∏ P(z<br /> <br /> cNB = arg max P(ci ).<br /> <br /> j=1<br /> <br /> 1. What happens if no training instances<br /> associated with class ci have attribute value<br /> xj?<br /> P( xj | ci ) = 0, and hence:<br /> n<br /> <br /> P(ci ).∏ P(x j | ci ) = 0<br /> j=1<br /> <br /> Solution: to use a Bayesian approach to<br /> estimate P( xj | ci )<br /> <br /> P(x j | ci ) =<br /> <br /> n(ci , x j ) + mp<br /> n(ci ) + m<br /> <br />  n(ci ) : number of training instances<br /> <br /> n<br /> <br /> ci∈C<br /> <br /> n<br /> <br /> ∏ P(x<br /> <br /> c* = arg max P(ci ).<br /> <br /> There are two issues we need to solve:<br /> <br /> c= arg max<br /> <br /> 134<br /> <br /> that attribute value given classification ci:<br /> <br /> j<br /> <br /> | ci )<br /> <br /> associated with class ci<br /> <br /> Trịnh Minh Đức<br /> <br /> Tạp chí KHOA HỌC & CÔNG NGHỆ<br /> <br /> 118(04): 133 - 137<br /> <br />  n(ci , x j ) : number of training instances<br /> <br /> efficiency of the classification. An email<br /> <br /> associated with class ci that have attribute<br /> <br /> consists of<br /> <br /> value xj.<br /> <br /> title, content, attachment or non… A simple<br /> <br />  p: a prior estimate for P(x j | ci ) →<br /> <br /> Assume uniform priors: p =<br /> <br /> 1<br /> , if attribute X<br /> k<br /> <br /> has k possible values<br />  m: a weight given to prior → To augment<br /> the n(ci) actual observations by an additional<br /> m virtual samples distributed according to p.<br /> <br /> a lot of charateristics such as:<br /> <br /> example: if we know that 95% HTML emails<br /> is spam, and we receive a HTML email, thus<br /> being able to base oneself on this prior<br /> probability<br /> <br /> in<br /> <br /> order<br /> <br /> to<br /> <br /> compute<br /> <br /> the<br /> <br /> probability of email that we receive is spam,<br /> if this probability is greater than the<br /> <br /> 2. The limit of precision in computers’<br /> <br /> probability given non-spam, it can be<br /> <br /> computing capability<br /> <br /> concluded that the email is spam, however,<br /> <br /> P(xj | ci) < 1, for every attribute value xj and<br /> <br /> this conclusion is not very accurate. However,<br /> <br /> class ci. So, when the number of attribute<br /> <br /> the more if we know much information, the<br /> <br /> values is very large<br /> <br /> greater the probability of correct classification<br /> is. To obtain prior probabilities, using Naïve<br /> Bayes method to train the set of early<br /> <br /> Solution: to use a logarithmic function of<br /> <br /> template emails, then using these probabilities<br /> to classify a new email. The probability<br /> <br /> probability<br /> c* =<br /> <br /> arg max<br /> ci∈C<br /> <br /> calculation will be based on Naïve Bayes<br /> formula. With the obtained probability values,<br /> we compare them with each other. If spam<br /> <br /> In the spam email classification problem,<br /> <br /> probability is greater than non-spam, then we<br /> <br /> each sample that we consider is an email. The<br /> <br /> can conclude that the email is spam, the<br /> <br /> set of classes that each email can belong to<br /> <br /> opposite is non-spam. [5]<br /> <br /> C={ spam, non-spam}.<br /> <br /> EXPERIMENTAL RESULTS<br /> <br /> When we receive an email, if we do not know<br /> <br /> We have implemented a test which applied<br /> <br /> any information about it, it’s so hard to decide<br /> <br /> Naïve Bayes method in email classification.<br /> <br /> exactly this email is spam or non-spam. If we<br /> <br /> The total number of emails in the sample<br /> <br /> have more certain characteristics or attributes<br /> <br /> dataset is 4601, including 1813 spam emails<br /> <br /> of<br /> <br /> (accounting for 39.4%). This dataset which can<br /> <br /> an email, then we can improve the<br /> <br /> 135<br /> <br /> Trịnh Minh Đức<br /> <br /> Tạp chí KHOA HỌC & CÔNG NGHỆ<br /> <br /> 118(04): 133 - 137<br /> <br /> be downloaded at http://archive.ics.uci.edu/ml/<br /> <br />  n N −> N is the number of non-spam emails<br /> <br /> datasets/Spambase<br /> <br /> which the filter recognizes as non-spam<br />  N N is the total number of non-spam<br /> <br /> is called Spambase.<br /> <br /> This dataset is divided into two disjoint<br /> subsets: the training set D_train (accounting<br /> for 66.7%) – for training the system and the<br /> test set D_test (33.3%) – for evaluating the<br /> trained system.<br /> In order to evaluate a machine learning<br /> system’s performance, we often use some<br /> measures such as: Precision (P), Recall (R),<br /> Accuracy rate (Acc), Error rate (Err), F1measure.<br /> <br /> emails<br />  NS is the total number of spam emails<br /> Experimental results on the Spambase<br /> dataset<br /> We present test results for two options of the<br /> division of the Spambase dataset:<br /> Experiment 1: divide the original Spambase<br /> dataset with a proportion k1 =<br /> <br /> 2<br /> 2<br /> , in that,<br /> 3<br /> 3<br /> <br /> the dataset for training and the remaining for<br /> testing.<br /> Experiment 2: divide the original Spambase<br /> dataset with a proportion k2 =<br /> <br /> Formulas to compute these measures as<br /> follows:<br /> <br /> n S −>S<br /> n S −>S + n N −>S<br /> <br /> P=<br /> <br /> n S−>S<br /> n S−>S + n S−> N<br /> <br /> R=<br /> <br /> Acc =<br /> <br /> Err =<br /> <br /> F1 =<br /> <br /> n N −> N + n S−>S<br /> N N + NS<br /> n N −>S + n S−> N<br /> N N + NS<br /> <br /> 2.P.R<br /> 2<br /> =<br /> P+R 1 + 1<br /> P R<br /> <br /> 9<br /> , in that,<br /> 10<br /> <br /> 9<br /> the dataset for training and the remaining<br /> 10<br /> for testing.<br /> Table 1. Testing results<br /> <br /> n S−>S<br /> n S−> N<br /> n N −> N<br /> n N −>S<br /> Recall<br /> Precison<br /> Acc<br /> Err<br /> F1-measure<br /> <br /> Experiment 1<br /> 486<br /> <br /> Experiment 2<br /> 180<br /> <br /> 119<br /> <br /> 2<br /> <br /> 726<br /> <br /> 276<br /> <br /> 204<br /> <br /> 3<br /> <br /> 80.33%<br /> 70.43%<br /> 78.96%<br /> 21.04%<br /> 75.05%<br /> <br /> 98.90%<br /> 98.36%<br /> 98.92%<br /> 1.18%<br /> 98.63%<br /> <br /> Where:<br /> <br /> The testing result in this experiment 2 have<br /> very high accuracy (approximately 99%).<br /> <br />  n S −>S is the number of spam emails which<br /> <br /> Conclusion<br /> <br /> the filter recognizes as spam<br />  n S −> N is the number of spam emails<br /> <br /> In this paper, we have examined the effect of<br /> <br /> which the filter recognizes as non-spam<br />  n N −>S is the number of non-spam emails<br /> <br /> a classifier which has a self-learning ability to<br /> <br /> which the filter recognizes as spam<br /> 136<br /> <br /> the Naïve Bayes classification method. This is<br /> improve classification performance. Naïve<br /> <br /> Trịnh Minh Đức<br /> <br /> Tạp chí KHOA HỌC & CÔNG NGHỆ<br /> <br /> 118(04): 133 - 137<br /> <br /> Bayes classifier proved suitable for email<br /> <br /> REFERENCES<br /> <br /> classification problem. Currently, we are<br /> <br /> [1]. Jonathan A.Zdziarski, Ending Spam: Bayesian<br /> Content Filtering and the Art of Statistical<br /> Language Classification - Press 2006.<br /> [2]. Mehran Sahami. Susan Dumais. David<br /> Heckerman. Eric Horvitz., A Bayesian Approach<br /> to Filtering Junk E-Mail.<br /> [3]. Sun Microsystem, JavaMail API Design<br /> Specification Version 1.4.<br /> [4]. T. M. Mitchell. Machine Learning. McGrawHill, 1997.<br /> [5]. Lê Nguyễn Bá Duy , Tìm hiểu các hướng<br /> tiếp cận phân loại email và xây dựng phần mềm<br /> mail client hỗ trợ tiếng việt, Đại học khoa học tự<br /> nhiên, Tp.Hồ Chí Minh, 2005.<br /> <br /> continuing to build standard as well as<br /> training samples and can adjust the Naïve<br /> Bayes algorithm to improve the accuracy of<br /> classification. In the near future, we will build<br /> a standard training and testing data system for<br /> both English and Vienamese. This is a big<br /> problem and need to focus more effort.<br /> <br /> TÓM TẮT<br /> LỌC THƯ RÁC DỰA TRÊN HỌC MÁY<br /> Trần Minh Đức*<br /> Trường ĐH Công nghệ thông tin và Truyền thông – ĐH Thái Nguyên<br /> <br /> Trong bài báo này tôi giới thiệu phương pháp phân loại thư rác dựa trên học máy, vì cách tiếp cận<br /> này có hiệu quả cao. Với khả năng học (tự cải thiện hiệu năng), thì hệ thống có thể tự động học và<br /> cải thiện được hiệu quả phân loại thư rác. Đồng thời, khả năng phân loại của hệ thống cũng sẽ liên<br /> tục được cập nhật theo những mẫu thư rác mới và vì vậ y, sẽ rất khó để các spammers vượt qua<br /> được, so với các cách tiếp cận truyền thống khác.<br /> Từ khóa: Học máy, lọc thư rác, Naïve Bayes.<br /> <br /> Ngày nhận bài: 13/3/2014; Ngày phản biện: 15/3/2014; Ngày duyệt đăng: 25/3/2014<br /> Phản biện khoa học: TS. Trương Hà Hải – Trường ĐH CNTT&TT – ĐH Thái Nguyên<br /> *<br /> <br /> Tel: 0984215060; Email: duchoak15@gmail.com<br /> <br /> 137<br /> <br />
ADSENSE

CÓ THỂ BẠN MUỐN DOWNLOAD

 

Đồng bộ tài khoản
2=>2