Trịnh Minh Đức<br />
<br />
Tạp chí KHOA HỌC & CÔNG NGHỆ<br />
<br />
118(04): 133 - 137<br />
<br />
7<br />
<br />
SPAM EMAIL FILTERING BASED ON MACHINE LEARNING<br />
Trinh Minh Duc*<br />
College of Information and Comunication Technology – TNU<br />
<br />
SUMMARY<br />
In the paper, we are going to present a spam email filtering method based on machine learning,<br />
namely Naïve Bayes classification method because this approach is highly effective. With the<br />
learning ability (self improving performance), a system applied this method can automatically<br />
learn and ameliorate the effect of spam email classification. Simultaneously, the ability of system’s<br />
classification is also updated by new incoming emails, therefore, it is very difficult for spammers<br />
to overcome the classifier, compared to traditional solutions.<br />
Key words: Machine learning, email spam filtering, Naïve Bayes.<br />
<br />
INTRODUCTION*<br />
The Email classification is actually the twoclass text classification problem, that is: the<br />
early dataset consists of spam and non-spam<br />
emails, the texts to be classified as the emails<br />
are sent to inbox. The output of the<br />
classification process is to determine the class<br />
label for an email – belonging to either one of<br />
the two classes: spam or non-spam<br />
.<br />
The general model of the spam email<br />
classification problem can be discribed as<br />
follows:<br />
<br />
The categorization process can be divided<br />
two phases:<br />
<br />
The training phase: The input of this phase is<br />
the set of spam and non-spam emails. The<br />
output is the trained data applied a suitable<br />
classification method to serve for the<br />
classification period.<br />
The classification phase: The input of this<br />
phase is an email, together with the trained<br />
data. The output is the classification result of<br />
the email: spam or non-spam.<br />
The rest of this paper is organized as follows.<br />
In Sect. 2, we formulate Naïve Bayes<br />
classification method and our solution. In<br />
Sect. 3, we show experimental results to<br />
evaluate the efficiency of this method.<br />
Finally, in Sect. 4, we conclude by showing<br />
possible future directions.<br />
NAïVE<br />
BAYES<br />
METHOD [4]<br />
<br />
Figure 1. The spam email classification model<br />
*<br />
<br />
Tel: 0984215060; Email: duchoak15@gmail.com<br />
<br />
CLASSIFICATION<br />
<br />
Naïve Bayes method is a supervised learning<br />
classification method, and based on<br />
probability, based on a probability model<br />
(function). The classification process is based<br />
on the probability values of the likelihood of<br />
the hypotheses. Classification technique of<br />
Naïve Bayes is based on Bayes theorem and<br />
is particularly suitable for the cases whose<br />
input size is large. Although Naïve Bayes is<br />
133<br />
<br />
Trịnh Minh Đức<br />
<br />
Tạp chí KHOA HỌC & CÔNG NGHỆ<br />
<br />
quite simple, its classification capability is<br />
much better than other complex methods. Due<br />
to the statistically dependent relaxation<br />
hypotheses, Naïve Bayes method considers<br />
the attributes conditionally independent with<br />
each other.<br />
<br />
118(04): 133 - 137<br />
<br />
Naïve Bayes classifying algorithm can be<br />
described succinctly as follows:<br />
The learning phase (given a training set). For<br />
each classification (i.e., class label) ci ∈ C<br />
<br />
Classification problem<br />
<br />
estimate the priori probability P( ci). For each<br />
<br />
A training set D, where each training instance<br />
x is represented as a n-dimensional attribute<br />
<br />
attribute value xj, estimate the probability of<br />
<br />
<br />
<br />
vector x = (x1, x2, ..., xn). A pre-defined set of<br />
classes C={c1, c2 ..., cm}.Given a new instance<br />
z, which class should z be classified into?<br />
Probability P(ck | z) is called the probability<br />
that a new instance z likely belonging to the<br />
class ck is calculated as follows:<br />
<br />
P(xj|ci)<br />
The classification phase (given a new<br />
instance). For each classification ci ∈ C,<br />
compute the fomula:<br />
n<br />
<br />
P(ci ).∏ P(x j | ci )<br />
<br />
c = arg max P(ci | z)<br />
ci∈C<br />
<br />
j=1<br />
<br />
c = arg max P(ci | z1 , z 2 ,..., z n )<br />
<br />
Select the most probable classification c*<br />
<br />
ci∈C<br />
<br />
P(z1 , z 2 ,..., z n | ci ).P(ci )<br />
c = arg max<br />
P(z1 , z 2 ,..., z n )<br />
ci∈C<br />
ci∈C<br />
<br />
P(z1 , z 2 ,..., z n | ci ).P(ci )<br />
( P(z1 , z 2 ,..., z n ) is the same for all classes)<br />
Assumption in Naïve Bayes method: The<br />
attributes are conditionally independent given<br />
classification:<br />
n<br />
<br />
P(z1 , z 2 ,..., z n | ci ) = ∏ P(z j | ci )<br />
j=1<br />
<br />
Naïve Bayes classifier finds the most<br />
probable class for z:<br />
<br />
ci∈C<br />
<br />
j=1<br />
<br />
j<br />
<br />
| ci )<br />
<br />
∏ P(z<br />
<br />
cNB = arg max P(ci ).<br />
<br />
j=1<br />
<br />
1. What happens if no training instances<br />
associated with class ci have attribute value<br />
xj?<br />
P( xj | ci ) = 0, and hence:<br />
n<br />
<br />
P(ci ).∏ P(x j | ci ) = 0<br />
j=1<br />
<br />
Solution: to use a Bayesian approach to<br />
estimate P( xj | ci )<br />
<br />
P(x j | ci ) =<br />
<br />
n(ci , x j ) + mp<br />
n(ci ) + m<br />
<br />
n(ci ) : number of training instances<br />
<br />
n<br />
<br />
ci∈C<br />
<br />
n<br />
<br />
∏ P(x<br />
<br />
c* = arg max P(ci ).<br />
<br />
There are two issues we need to solve:<br />
<br />
c= arg max<br />
<br />
134<br />
<br />
that attribute value given classification ci:<br />
<br />
j<br />
<br />
| ci )<br />
<br />
associated with class ci<br />
<br />
Trịnh Minh Đức<br />
<br />
Tạp chí KHOA HỌC & CÔNG NGHỆ<br />
<br />
118(04): 133 - 137<br />
<br />
n(ci , x j ) : number of training instances<br />
<br />
efficiency of the classification. An email<br />
<br />
associated with class ci that have attribute<br />
<br />
consists of<br />
<br />
value xj.<br />
<br />
title, content, attachment or non… A simple<br />
<br />
p: a prior estimate for P(x j | ci ) →<br />
<br />
Assume uniform priors: p =<br />
<br />
1<br />
, if attribute X<br />
k<br />
<br />
has k possible values<br />
m: a weight given to prior → To augment<br />
the n(ci) actual observations by an additional<br />
m virtual samples distributed according to p.<br />
<br />
a lot of charateristics such as:<br />
<br />
example: if we know that 95% HTML emails<br />
is spam, and we receive a HTML email, thus<br />
being able to base oneself on this prior<br />
probability<br />
<br />
in<br />
<br />
order<br />
<br />
to<br />
<br />
compute<br />
<br />
the<br />
<br />
probability of email that we receive is spam,<br />
if this probability is greater than the<br />
<br />
2. The limit of precision in computers’<br />
<br />
probability given non-spam, it can be<br />
<br />
computing capability<br />
<br />
concluded that the email is spam, however,<br />
<br />
P(xj | ci) < 1, for every attribute value xj and<br />
<br />
this conclusion is not very accurate. However,<br />
<br />
class ci. So, when the number of attribute<br />
<br />
the more if we know much information, the<br />
<br />
values is very large<br />
<br />
greater the probability of correct classification<br />
is. To obtain prior probabilities, using Naïve<br />
Bayes method to train the set of early<br />
<br />
Solution: to use a logarithmic function of<br />
<br />
template emails, then using these probabilities<br />
to classify a new email. The probability<br />
<br />
probability<br />
c* =<br />
<br />
arg max<br />
ci∈C<br />
<br />
calculation will be based on Naïve Bayes<br />
formula. With the obtained probability values,<br />
we compare them with each other. If spam<br />
<br />
In the spam email classification problem,<br />
<br />
probability is greater than non-spam, then we<br />
<br />
each sample that we consider is an email. The<br />
<br />
can conclude that the email is spam, the<br />
<br />
set of classes that each email can belong to<br />
<br />
opposite is non-spam. [5]<br />
<br />
C={ spam, non-spam}.<br />
<br />
EXPERIMENTAL RESULTS<br />
<br />
When we receive an email, if we do not know<br />
<br />
We have implemented a test which applied<br />
<br />
any information about it, it’s so hard to decide<br />
<br />
Naïve Bayes method in email classification.<br />
<br />
exactly this email is spam or non-spam. If we<br />
<br />
The total number of emails in the sample<br />
<br />
have more certain characteristics or attributes<br />
<br />
dataset is 4601, including 1813 spam emails<br />
<br />
of<br />
<br />
(accounting for 39.4%). This dataset which can<br />
<br />
an email, then we can improve the<br />
<br />
135<br />
<br />
Trịnh Minh Đức<br />
<br />
Tạp chí KHOA HỌC & CÔNG NGHỆ<br />
<br />
118(04): 133 - 137<br />
<br />
be downloaded at http://archive.ics.uci.edu/ml/<br />
<br />
n N −> N is the number of non-spam emails<br />
<br />
datasets/Spambase<br />
<br />
which the filter recognizes as non-spam<br />
N N is the total number of non-spam<br />
<br />
is called Spambase.<br />
<br />
This dataset is divided into two disjoint<br />
subsets: the training set D_train (accounting<br />
for 66.7%) – for training the system and the<br />
test set D_test (33.3%) – for evaluating the<br />
trained system.<br />
In order to evaluate a machine learning<br />
system’s performance, we often use some<br />
measures such as: Precision (P), Recall (R),<br />
Accuracy rate (Acc), Error rate (Err), F1measure.<br />
<br />
emails<br />
NS is the total number of spam emails<br />
Experimental results on the Spambase<br />
dataset<br />
We present test results for two options of the<br />
division of the Spambase dataset:<br />
Experiment 1: divide the original Spambase<br />
dataset with a proportion k1 =<br />
<br />
2<br />
2<br />
, in that,<br />
3<br />
3<br />
<br />
the dataset for training and the remaining for<br />
testing.<br />
Experiment 2: divide the original Spambase<br />
dataset with a proportion k2 =<br />
<br />
Formulas to compute these measures as<br />
follows:<br />
<br />
n S −>S<br />
n S −>S + n N −>S<br />
<br />
P=<br />
<br />
n S−>S<br />
n S−>S + n S−> N<br />
<br />
R=<br />
<br />
Acc =<br />
<br />
Err =<br />
<br />
F1 =<br />
<br />
n N −> N + n S−>S<br />
N N + NS<br />
n N −>S + n S−> N<br />
N N + NS<br />
<br />
2.P.R<br />
2<br />
=<br />
P+R 1 + 1<br />
P R<br />
<br />
9<br />
, in that,<br />
10<br />
<br />
9<br />
the dataset for training and the remaining<br />
10<br />
for testing.<br />
Table 1. Testing results<br />
<br />
n S−>S<br />
n S−> N<br />
n N −> N<br />
n N −>S<br />
Recall<br />
Precison<br />
Acc<br />
Err<br />
F1-measure<br />
<br />
Experiment 1<br />
486<br />
<br />
Experiment 2<br />
180<br />
<br />
119<br />
<br />
2<br />
<br />
726<br />
<br />
276<br />
<br />
204<br />
<br />
3<br />
<br />
80.33%<br />
70.43%<br />
78.96%<br />
21.04%<br />
75.05%<br />
<br />
98.90%<br />
98.36%<br />
98.92%<br />
1.18%<br />
98.63%<br />
<br />
Where:<br />
<br />
The testing result in this experiment 2 have<br />
very high accuracy (approximately 99%).<br />
<br />
n S −>S is the number of spam emails which<br />
<br />
Conclusion<br />
<br />
the filter recognizes as spam<br />
n S −> N is the number of spam emails<br />
<br />
In this paper, we have examined the effect of<br />
<br />
which the filter recognizes as non-spam<br />
n N −>S is the number of non-spam emails<br />
<br />
a classifier which has a self-learning ability to<br />
<br />
which the filter recognizes as spam<br />
136<br />
<br />
the Naïve Bayes classification method. This is<br />
improve classification performance. Naïve<br />
<br />
Trịnh Minh Đức<br />
<br />
Tạp chí KHOA HỌC & CÔNG NGHỆ<br />
<br />
118(04): 133 - 137<br />
<br />
Bayes classifier proved suitable for email<br />
<br />
REFERENCES<br />
<br />
classification problem. Currently, we are<br />
<br />
[1]. Jonathan A.Zdziarski, Ending Spam: Bayesian<br />
Content Filtering and the Art of Statistical<br />
Language Classification - Press 2006.<br />
[2]. Mehran Sahami. Susan Dumais. David<br />
Heckerman. Eric Horvitz., A Bayesian Approach<br />
to Filtering Junk E-Mail.<br />
[3]. Sun Microsystem, JavaMail API Design<br />
Specification Version 1.4.<br />
[4]. T. M. Mitchell. Machine Learning. McGrawHill, 1997.<br />
[5]. Lê Nguyễn Bá Duy , Tìm hiểu các hướng<br />
tiếp cận phân loại email và xây dựng phần mềm<br />
mail client hỗ trợ tiếng việt, Đại học khoa học tự<br />
nhiên, Tp.Hồ Chí Minh, 2005.<br />
<br />
continuing to build standard as well as<br />
training samples and can adjust the Naïve<br />
Bayes algorithm to improve the accuracy of<br />
classification. In the near future, we will build<br />
a standard training and testing data system for<br />
both English and Vienamese. This is a big<br />
problem and need to focus more effort.<br />
<br />
TÓM TẮT<br />
LỌC THƯ RÁC DỰA TRÊN HỌC MÁY<br />
Trần Minh Đức*<br />
Trường ĐH Công nghệ thông tin và Truyền thông – ĐH Thái Nguyên<br />
<br />
Trong bài báo này tôi giới thiệu phương pháp phân loại thư rác dựa trên học máy, vì cách tiếp cận<br />
này có hiệu quả cao. Với khả năng học (tự cải thiện hiệu năng), thì hệ thống có thể tự động học và<br />
cải thiện được hiệu quả phân loại thư rác. Đồng thời, khả năng phân loại của hệ thống cũng sẽ liên<br />
tục được cập nhật theo những mẫu thư rác mới và vì vậ y, sẽ rất khó để các spammers vượt qua<br />
được, so với các cách tiếp cận truyền thống khác.<br />
Từ khóa: Học máy, lọc thư rác, Naïve Bayes.<br />
<br />
Ngày nhận bài: 13/3/2014; Ngày phản biện: 15/3/2014; Ngày duyệt đăng: 25/3/2014<br />
Phản biện khoa học: TS. Trương Hà Hải – Trường ĐH CNTT&TT – ĐH Thái Nguyên<br />
*<br />
<br />
Tel: 0984215060; Email: duchoak15@gmail.com<br />
<br />
137<br />
<br />