Phạm Tuấn Anh và Đtg<br />
<br />
Tạp chí KHOA HỌC & CÔNG NGHỆ<br />
<br />
122(08): 21 - 26<br />
<br />
PHISHING ATTACKS DETECTION USING GENETIC PROGRAMMING<br />
WITH FEATURES SELECTION<br />
Tuan Anh Pham1, Thi Huong Chu2, Hoang Quan Nguyen2,<br />
Quang Uy Nguyen2, Xuan Hoai Nguyen3, Van Truong Nguyen4<br />
2The<br />
<br />
1Centre of IT, Military Academy of Logistics, Vietnam,<br />
Faculty of Information Technology, Le Quy Don University, Vietnam,<br />
3IT R&D Center, Hanoi University, Vietnam,<br />
4College of Education, TNU, Vietnam<br />
<br />
SUMMARY<br />
Phishing is a real threat on the Internet nowadays. Therefore, fighting against phishing attacks is of<br />
great importance. In this paper, we propose a solution to this problem by applying Genetic<br />
Programming with features selection methods to phishing detection problem. We conducted the<br />
experiments on a data set including both phishing and legitimate sites collected from the Internet.<br />
We compared the performance of Genetic Programming with a number of other machine learning<br />
techniques and the results showed that Genetic Programming produced the best solutions to<br />
phishing detection problem.<br />
Keywords: Genetic Programming, Phishing Attack, Machine Learning<br />
<br />
INTRODUCTION*<br />
Genetic Programming (GP) [2] is an<br />
evolutionary algorithm aimed to provide<br />
solutions to a user-defined task in the form of<br />
computer programs. Since its introduction,<br />
GP has been applied to many practical<br />
problems [2]. GP has also been used as a<br />
learning tool for solving some problems in<br />
network security [3]. However, to the best of<br />
our knowledge, there has not been any<br />
published work on the use of GP for learning<br />
to detect phishing web sites except our<br />
preliminary work in [4].<br />
In the field of network security, phishing<br />
attack is one of the main threat on the Internet<br />
nowadays [5]. Phishing attackers attempt to<br />
acquire confidential information such as<br />
usernames, passwords, and credit card details<br />
by disguising as a trustworthy entity in an<br />
online communication [5]. Due to the<br />
simplicity, phishing attacks are very popular. .<br />
According to a report released by an<br />
American security firm, RSA, there have been<br />
approximately 33,000 phishing attacks<br />
globally each month in 2012, leading to a loss<br />
of $687 million [1]. Therefore, detecting and<br />
*<br />
<br />
Tel: 0915 016063, Email: nvtruongtn@gmail.com<br />
<br />
eliminating phishing attacks is very important<br />
for not only organizations but also<br />
individuals. One popular and widely-used<br />
solution with most web browsers is to<br />
integrate blacklisted sites into them.<br />
However, this solution, which is unable to<br />
detect a new attack if the database is out of<br />
date, appears to be not effective when there<br />
are a large number of phishing attacks carried<br />
out very day.<br />
In a recent research [4], Pham et al. proposed<br />
a solution to this problem by applying<br />
Genetic Programming to phishing detection<br />
problem. The results showed that GP<br />
outperforms some other machine learning<br />
methods on this important problem. However,<br />
the research in [4] has some drawbacks.<br />
1) The data set for training and testing was<br />
rather small. Therefore, the models created<br />
based on this data set may not generalize well<br />
in the real environment.<br />
2) More important, the number of features<br />
used in [4] seems to be limited. Moreover,<br />
some features may not be relevant for<br />
distinguishing<br />
between<br />
phishing<br />
and<br />
legitimate sites. This may hinder the<br />
performance of machine learning methods in<br />
solving this problem.<br />
21<br />
<br />
Phạm Tuấn Anh và Đtg<br />
<br />
Tạp chí KHOA HỌC & CÔNG NGHỆ<br />
<br />
In this paper, we extend the work in [4] in<br />
several ways. The main contributions of this<br />
paper are:<br />
1) We enlarged both training and testing data<br />
set by collecting more phishing and legitimate<br />
sites from the Internet.<br />
2) We enriched the features set by adding<br />
some institutive features which may be<br />
beneficial for discriminating normal and<br />
phishing sites.<br />
3) We used a features selection method to<br />
eliminate some irrelevant features which<br />
helps to improve the performance Genetic<br />
Programming.<br />
The remainder of the paper is organized as<br />
follows. In the next section, we briefly review<br />
some previous research on detecting phishing<br />
attacks. In Section III we present our method<br />
using GP for solving the phishing detection<br />
problem. It is followed by a section detailing<br />
our experimental settings. The experimental<br />
results are shown and discussed in Section V.<br />
The last section concludes the paper and<br />
highlights some potential future works.<br />
RELATED WORKS<br />
Since phishing attacks are very popular, there<br />
has been a number of anti-phishing solutions<br />
proposed to date. Some methods aim to solve<br />
the phishing problem at the email level by<br />
preventing users from visiting the phishing<br />
sites. That is, the emails containing phishing<br />
sites are filtered before being able to reach to<br />
the potential victims. Apparently, these<br />
techniques are closely related to anti-spam<br />
research and has been used by both Microsoft<br />
[6] and Yahoo [7]. Other solutions attempt to<br />
protect valuable information from being<br />
exposed to the phishers by replacing<br />
passwords with site-specific tokens, or by<br />
using novel authentication mechanisms.<br />
These methods have been used in some<br />
popular anti-phishing tools such as PwdHash<br />
and AntiPhish.<br />
In PwdHash [8], a domain-specific password,<br />
that is rendered useless if it is submitted to<br />
22<br />
<br />
122(08): 21 - 26<br />
<br />
another domain, is created (e.g., a password<br />
for www.gmail.com will be different if<br />
submitted to www.attacker.com). Conversely,<br />
AntiPhish [9] takes a different approach by<br />
keeping track of where confidential<br />
information such as a password is being<br />
submitted. That is, if it detects that a<br />
password is being entered into a form on an<br />
untrusted web site, a warning is generated and<br />
the current operation is canceled.<br />
In this paper, we will focus on the approaches<br />
that only use the information available from<br />
the URL and the pages source code.<br />
Currently, there are two main such<br />
approaches for identifying phishpages - based<br />
on URL blacklists; and based on the<br />
properties of the page and (sometimes) the<br />
URL. More detailed description about these<br />
methods can be found in [4].<br />
METHODS<br />
This section presents the methods used in this<br />
paper. The way to extract the features for<br />
each web site is presented first. The method<br />
for features selection is discussed after that.<br />
Finally, the GP system for phishing detection<br />
is described.<br />
Features Extraction<br />
The first step of using GP to tackle the<br />
phishing detection problem is features<br />
extraction/selection. The extracted features<br />
must contain information that helps to<br />
distinguish phishing and legitimate sites. In<br />
this paper, we extend the features set in [4] by<br />
adding some more features that are based on<br />
URL of the sites. Totally, eighteen features<br />
are used in this paper including twelve<br />
content-based features that have been used in<br />
[4] and six new URL-based features. These<br />
six URL-based features are taken from [10]<br />
and are described as follows.<br />
• URL1: number of ’@’ in URL (X13).<br />
• URL2: number of ’-’ in URL (X14).<br />
• URL3: number of ’.’ in URL (X15).<br />
• URL4: number of ’.’ in URL (X16).<br />
<br />
Phạm Tuấn Anh và Đtg<br />
<br />
Tạp chí KHOA HỌC & CÔNG NGHỆ<br />
<br />
• URL5: 1 if URL contain word ’ebayisapi’,<br />
otherwise 0 (X17).<br />
• URL6: 1 if URL contain word ’banking’,<br />
otherwise 0 (X18).<br />
Features Selection<br />
Feature selection is the process of choosing a<br />
subset of features relevant to a particular<br />
application [11]. There have been a number of<br />
features selection proposed for machine<br />
learning algorithms [12]. Among them,<br />
statistics based methods have shown good<br />
performance on a number of problems [12].<br />
In this paper, we use the mutual information<br />
(MI) concept as the features selection<br />
criterion.<br />
Mutual information (MI) is a basic concept in<br />
information theory. It is a measure of general<br />
interdependence between random variables<br />
[12]. Specifically, given two random<br />
variables X and Y, the mutual information<br />
I(X;Y) is defined as follows:<br />
I (X ; Y ) = H (X ) + H (Y ) − H (X ; Y ) (1)<br />
where H() is the entropy of a random variable<br />
and measures the uncertainty associated with<br />
it. If X is a discrete random variable, H(X) is<br />
defined as follows:<br />
H (X ) = −∑ P (X )log2 (P (X )) (2)<br />
Calculating exactly mutual information (MI)<br />
between two random variables is not a<br />
straightforward task. Therefore, it is often<br />
necessary that this value is estimated. In this<br />
paper, we estimate MI using the histogram<br />
approach [12]. According to this method, the<br />
probability density function of each variable<br />
is approximated using a histogram. Then, the<br />
MI can be calculated according to the<br />
following equation:<br />
<br />
I ( X ;Y ) P( X , Y )log 2<br />
x<br />
<br />
y<br />
<br />
P( X , Y )<br />
P( X ) P(Y )<br />
<br />
(3)<br />
<br />
where the summations are calculated over the<br />
appropriately discretized values of the<br />
random variables X and Y. For each<br />
histogram bin, the joint probability<br />
<br />
122(08): 21 - 26<br />
<br />
distribution P(X,Y) is estimated by counting<br />
the number of cases that fall into a particular<br />
bin and dividing that number with the total<br />
number of cases. The same technique is<br />
applied for the histogram approximation of<br />
the marginal distributions P(X) and P(Y).<br />
Choosing an appropriate bin is a crucial issue.<br />
In this paper, we follow [19] in choosing the<br />
number of bins based on the Gaussianity rule.<br />
With Gaussian data, the proper number of<br />
bins is log2 N + 1.<br />
System Description<br />
The evolutionary learning process of GP for<br />
solving the problem of phishing detection is<br />
divided into two stages: training and testing.<br />
The objective of training stage is to evolve the<br />
model (the classifier) that can determine a site<br />
as either phishing or legitimate based on its<br />
feature values. In the testing stage, the learnt<br />
model is used to make predictions on the<br />
unseen data. The accuracy of this prediction is<br />
used as an indicator for the quality<br />
(effectiveness) of the model.<br />
In the training stage, a set of training sites<br />
(both phishing and benign) with their labels<br />
(either as phishing or normal) are provided.<br />
The feature extraction process is called to<br />
convert every site to a feature vector. This<br />
vector is then served as the input for an<br />
individual in GP and the output of the<br />
individual is a real value. If this real value is<br />
greater than zero, this site is tagged as a<br />
phishing, otherwise it is considered as benign.<br />
The next step in the training process is to<br />
measure the fitness of an individual in GP. In<br />
this paper, we use a simple way to measure<br />
the fitness of individual where the fitness is<br />
the percentage of sites in the training set that<br />
are correctly classified. This fitness, thought<br />
may not be a good indicator if the data is<br />
imbalance, is intuitive to identify the overall<br />
quality of a model.<br />
EXPERIMENTAL SETTINGS<br />
This section outlines the settings used in our<br />
experiments. First, we present the way that<br />
23<br />
<br />
Phạm Tuấn Anh và Đtg<br />
<br />
Tạp chí KHOA HỌC & CÔNG NGHỆ<br />
<br />
data was collected for training and testing the<br />
systems. After that GP configurations for the<br />
experiments are described.<br />
Data Collection<br />
The data used for training and testing the<br />
system in this paper was collected from both<br />
phishing sites and legitimate sites on the<br />
Internet. The process is similar to that in[4]<br />
except the number of pages is larger. In this<br />
paper, we collected 3528 phishing pages and<br />
3965 normal pages.<br />
From the data set, eighteen properties on each<br />
page were extracted to create the set feature<br />
vectors. We retained only one feature vector<br />
in case there is duplication in the data set.<br />
Moreover, if a feature vector presented in<br />
both phishing data and legitimate data, this<br />
vector was removed. As a result, 1800 feature<br />
vectors for phishing and 1200 feature vectors<br />
for legitimate data were retrieved. Totally, we<br />
obtained 1800+1200=3000 feature vectors of<br />
both phishing and legitimate sites. These<br />
vectors are mixed and divided into two sets:<br />
one for training (1000 samples) and the other<br />
for testing (the rest). Finally, feature values<br />
were normalized to the range between (0, 1),<br />
and the vectors extracted from phishing pages<br />
were labeled 1, otherwise labeled 0.<br />
GP Parameters Settings<br />
To tackle a problem with GP, several<br />
elements need to be clarified beforehand.<br />
These elements often depend on the problem<br />
and the experience of practitioners. The first<br />
and important element is the fitness function.<br />
As aforementioned, in this paper we use the<br />
percentage of correct classifications as the<br />
fitness measurement for each individual in the<br />
population. Other factors that strongly affect<br />
the performance of GP are the set of nonterminals and terminals. The terminal sets<br />
include 18 variables (X1, X2,...,X18 )<br />
representing 18 features extracted from the<br />
sites. The non-terminal set include 5 functions<br />
(+, -, *, /, iff). Here, we used the protected<br />
24<br />
<br />
122(08): 21 - 26<br />
<br />
versions of division (/), meaning that if the<br />
denominator is zero, the returned value is 1.<br />
Other evolutionary parameters are kept the<br />
same as [2].<br />
We divided our experiments into three sets. In<br />
the first, we repeated the experiments in [4]<br />
meaning that we used only twelve features<br />
from X1 to X12. However, the data sets for<br />
both training and testing in this experiment<br />
are much larger than those in [4]. We used<br />
1000 samples for training and 2000 for testing<br />
(compared with only 516 and 288 for training<br />
and testing samples in [4]). The objective of<br />
this experiment is to see if the performance of<br />
GP on a larger data set is still maintained.<br />
In the second set we aimed to examine the<br />
impact of enriching the features set to the<br />
performance of GP in phishing detection<br />
problem. Similar to the experiment in [4], we<br />
also compared the performance of GP with<br />
several well-known machine learning<br />
techniques<br />
including<br />
Support<br />
Vector<br />
Machines, Artificial Neural Networks and<br />
Bayesian Networks.<br />
In the third set, we investigated the impact of<br />
features selection scheme that are based on<br />
the mutual information to the performance of<br />
all tested machine learning methods. This<br />
experimental set aims to see if using the<br />
features selection method help to remove<br />
some irrelevant features and leading to the<br />
better performance of learning methods. The<br />
detail about these experiments are presented<br />
in the following section.<br />
RESULTS AND DISCUSSION<br />
To determine quality of the models produced<br />
by GP, at the end of each run, we selected the<br />
best-of-the-run individual (the individual with<br />
the best fitness on the training set in the entire<br />
run). This model is then tested on the testing<br />
set and the output on the testing set is<br />
considered as the prediction error of the<br />
model. In order to experiment other machine<br />
learning techniques to solve the problem, we<br />
<br />
Phạm Tuấn Anh và Đtg<br />
<br />
Tạp chí KHOA HỌC & CÔNG NGHỆ<br />
<br />
used their implementations in Weka. We<br />
compare the results produced by these<br />
methods with the results obtained by GP. The<br />
percentage of correct prediction of these<br />
methods in three experiments (Exp) is<br />
presented in Table 1. In this Table, GP is the<br />
results produced by genetic programming.<br />
SVM is shorthanded for Support Vector<br />
Machine while ANN stands for the Artificial<br />
Neuron Network. It should be noted that in all<br />
Figures, the greater values are better.<br />
It can be seen that the results in Table 1 are<br />
consistent with the results in [4]. It confirms<br />
that the best model produced by GP is also<br />
the best model among all models produced by<br />
all learning systems. Overall, the prediction<br />
accuracy of GP learnt model is about 71% in<br />
the first experiment. These values of other<br />
methods ranges from 54% to 67% with the<br />
lowest value is obtained by SVM while the<br />
highest value is obtained by ANN.<br />
Table 1. The Percentage of Correct Prediction<br />
Exp<br />
<br />
GP<br />
<br />
SVM<br />
<br />
ANN<br />
<br />
BayesNet<br />
<br />
Exp1<br />
Exp2<br />
Exp3<br />
<br />
71.6<br />
76.3<br />
78.8<br />
<br />
54.3<br />
56.5<br />
58.1<br />
<br />
68.2<br />
74.2<br />
73.2<br />
<br />
63.6<br />
73.1<br />
73.6<br />
<br />
The second experimental set was aimed to<br />
test if by adding more features (that are based<br />
on URL) to the features set, we can obtain<br />
better performance of these learning methods<br />
on this problems. The results of the second<br />
experiment are presented in the second row of<br />
Table 1.<br />
It can be seen that by enriching features set,<br />
the performances of almost all learning<br />
methods were improved. The most<br />
remarkable improvement is achieved with<br />
ANN and BayesNet. The accuracy of these<br />
two methods increased to around 74%. With<br />
AVM, its performance was also enhanced<br />
from 54% to around 57%. However, what is<br />
more important is that the performance of GP<br />
is also improved and it still obtained the best<br />
results amongst all tested techniques. The<br />
<br />
122(08): 21 - 26<br />
<br />
results obtained by GP with this features set is<br />
about 76%. In general, the results in this<br />
experiment show the beneficial effect of<br />
adding some URL-based features to the<br />
features set in this problem.<br />
The results in the second experimental set<br />
show that enriching features set helps to<br />
improve the performance of learning<br />
algorithms in phishing detection problem.<br />
However, this larger features set may also<br />
contains some irrelevant features that might<br />
hinder the performance of GP and other<br />
learning<br />
methods.<br />
Therefore,<br />
this<br />
experimental set aims to examine if using the<br />
features selection method based on mutual<br />
information helps to eliminate irrelevant<br />
features and leading to the better<br />
performance. We first calculated the mutual<br />
information between each feature and the<br />
label of the whole data set (including both<br />
training and testing set). After that, we sorted,<br />
in ascending order, the features based on its<br />
mutual information with the label. We<br />
omitted X8, X17 and X18 from the features<br />
set due to its loosely related to the label and<br />
we conducted the above experiments with the<br />
new features set. The results are given in the<br />
row 3 of Table 1. It can be seen from these<br />
results that by using the features selection<br />
technique to eliminate some irrelevant<br />
features (X8, X17 and X18 in this paper), we<br />
can achieve better performance for GP. While<br />
the performance of other learning algorithms<br />
is mostly the same with the experiment in the<br />
second set, the performance of GP is keeping<br />
enhanced and it obtains the best result in all<br />
experiments at about 78%. Overall, the<br />
experiments in this paper show the ability of<br />
GP in tackling phishing detecting problem<br />
and if we enrich the features set and using<br />
features selection to eliminate irrelevant<br />
features we can achieve rather good result, up<br />
to approximate 80% of correct prediction.<br />
Comparing to the best result in [4] with only<br />
about 70%, this is a significant improvement.<br />
25<br />
<br />