The thesis for the degree of doctor of Philosophy in mathematics: Improving some artificial immune algorithms for network intrusion detection

MINISTRY OF EDUCATION VIETNAMESE ACADEMY

AND TRAINING OF SCIENCE AND TECHNOLOGY

GRADUATE UNIVERSITY OF SCIENCE AND TECHNOLOGY ————————————

NGUYEN VAN TRUONG

IMPROVING SOME ARTIFICIAL IMMUNE ALGORITHMS FOR NETWORK INTRUSION DETECTION

THE THESIS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN MATHEMATICS

Hanoi - 2019

MINISTRY OF EDUCATION VIETNAMESE ACADEMY

AND TRAINING OF SCIENCE AND TECHNOLOGY

GRADUATE UNIVERSITY OF SCIENCE AND TECHNOLOGY ————————————

NGUYEN VAN TRUONG

IMPROVING SOME ARTIFICIAL IMMUNE ALGORITHMS FOR NETWORK INTRUSION DETECTION

THE THESIS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN MATHEMATICS Major: Mathematical foundations for Informatics Code: 62 46 01 10

Scientiﬁc supervisor:

1. Assoc. Prof., Dr. Nguyen Xuan Hoai

2. Assoc. Prof., Dr. Luong Chi Mai

Hanoi - 2019

Acknowledgments

First of all I would like to thank is my principal supervisor, Assoc. Prof.,

Dr. Nguyen Xuan Hoai for introducing me to the ﬁeld of Artiﬁcial Immune System.

He guides me step by step through research activities such as seminar presentations,

paper writing, etc. His genius has been a constant source of help. I am intrigued

by his constructive criticism throughout my PhD. journey. I wish also to thank my

co-supervisor, Assoc. Prof., Dr. Luong Chi Mai. She is always very enthusiastic in

our discussion promising research questions. It is a pleasure and luxury for me to work

with her. This thesis could not have been possible without my supervisors’ support.

I gratefully acknowledge the support from Institute of Information Technology,

Vietnamese Academy of Science and Technology, and from Thai Nguyen University

of Education. I thank the ﬁnancial support from the National Foundation for Science

and Technology Development (NAFOSTED), ASEAN-European Academic University

Network (ASEA-UNINET).

I thank M.Sc. Vu Duc Quang, M.Sc. Trinh Van Ha and M.Sc. Pham Dinh

Lam, my co-authors of published papers. I thank Assoc. Prof., Dr. Tran Quang

Anh and Dr. Nguyen Quang Uy for many helpful insights for my research. I thank

colleagues, especially my cool labmate Mr. Nguyen Tran Dinh Long, in IT Research

& Development Center, HaNoi University.

Finally, I thank my family for their endless love and steady support.

Certiﬁcate of Originality

I hereby declare that this submission is my own work under my scientiﬁc super-

visors, Assoc. Prof., Dr. Nguyen Xuan Hoai, and Assoc. Prof., Dr. Luong Chi Mai. I

declare that, it contains no material previously published or written by another person,

except where due reference is made in the text of the thesis. In addition, I certify that

all my co-authors allow me to present our work in this thesis.

Hanoi, 2019

PhD. student

Nguyen Van Truong

List of Figures v

List of Tables vii

Notation and Abbreviation viii

INTRODUCTION 1

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Problem statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Outline of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1 BACKGROUND 5

1.1 Detection of Network Anomalies . . . . . . . . . . . . . . . . . . . . . . 5

1.1.1 Host-Based IDS . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.1.2 Network-Based IDS . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.1.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.1.4 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2 A brief overview of human immune system . . . . . . . . . . . . . . . . 8

1.3 AIS for IDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3.1 AIS model for IDS . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3.2 AIS features for IDS . . . . . . . . . . . . . . . . . . . . . . . . 11

1.4 Selection algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.4.1 Negative Selection Algorithms . . . . . . . . . . . . . . . . . . . 12

1.4.2 Positive Selection Algorithms . . . . . . . . . . . . . . . . . . . 15

1.5 Basic terms and deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.5.1 Strings, substrings and languages . . . . . . . . . . . . . . . . . 16

1.5.2 Preﬁx trees, preﬁx DAGs and automata . . . . . . . . . . . . . 17

1.5.3 Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.5.4 Detection in r-chunk detector-based positive selection . . . . . . 20

1.5.5 Holes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.5.6 Performance metrics . . . . . . . . . . . . . . . . . . . . . . . . 22

1.5.7 Ring representation of data . . . . . . . . . . . . . . . . . . . . 23

1.5.8 Frequency trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.6 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

1.6.1 The DARPA-Lincoln datasets . . . . . . . . . . . . . . . . . . . 27

1.6.2 UT dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

1.6.3 Netﬂow dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

1.6.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

1.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2 COMBINATION OF NEGATIVE SELECTION AND POSITIVE SE-

LECTION 30

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.3 New Positive-Negative Selection Algorithm . . . . . . . . . . . . . . . . 31

2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3 GENERATION OF COMPACT DETECTOR SET 43

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.3 New negative selection algorithm . . . . . . . . . . . . . . . . . . . . . 45

iii

45 3.3.1 Detectors set generation under rcbvl matching rule . . . . . . .

3.3.2 Detection under rcbvl matching rule . . . . . . . . . . . . . . . 48

3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4 FAST SELECTION ALGORITHMS 51

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.3 A fast negative selection algorithm based on r-chunk detector . . . . . . 52

4.4 A fast negative selection algorithm based on r-contiguous detector . . . 57

4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5 APPLYING HYBRID ARTIFICIAL IMMUNE SYSTEM FOR NET-

WORK SECURITY 66

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.3 Hybrid positive selection algorithm with chunk detectors . . . . . . . . 69

5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.4.2 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.4.3 Performance metrics and parameters . . . . . . . . . . . . . . . 72

5.4.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

CONCLUSIONS 78

Contributions of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Future works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

Published works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

BIBLIOGRAPHY 81

List of Figures

1.1 Classiﬁcation of anomaly-based intrusion detection methods . . . . . . 7

1.2 Multi-layered protection and elimination architecture . . . . . . . . . . 9

1.3 Multi-layer AIS model for IDS . . . . . . . . . . . . . . . . . . . . . . . 10

1.4 Outline of a typical negative selection algorithm. . . . . . . . . . . . . . 13

1.5 Outline of a typical positive selection algorithm. . . . . . . . . . . . . . 15

1.6 Example of a preﬁx tree and a preﬁx DAG. . . . . . . . . . . . . . . . . 18

1.7 Existence of holes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1.8 Negative selections with 3-chunk and 3-contiguous detectors. . . . . . . 23

1.9 A simple ring-based representation (b) of a string (a). . . . . . . . . . . 25

1.10 Frequency trees for all 3-chunk detectors. . . . . . . . . . . . . . . . . . 26

2.1 Binary tree representation of the detectors set generated from S. . . . . 33

2.2 Conversion of a positive tree to a negative one. . . . . . . . . . . . . . . 33

2.3 Diagram of the Detector Generation Algorithm. . . . . . . . . . . . . . 35

2.4 Diagram of the Positive-Negative Selection Algorithm. . . . . . . . . . 37

2.5 One node is reduced in a tree: a compact positive tree has 4 nodes (a)

and its conversion (a negative tree) has 3 node (b). . . . . . . . . . . . 38

2.6 Detection time of NSA and PNSA. . . . . . . . . . . . . . . . . . . . . 40

2.7 Nodes reduction on trees created by PNSA on Netﬂow dataset. . . . . . 41

2.8 Comparison of nodes reduction on Spambase dataset. . . . . . . . . . . 41

3.1 Diagram of a algorithm to generate perfect rcbvl detectors set. . . . . . 47

4.1 Diagram of the algorithm to generate positive r-chunk detectors set. . . 55

4.2 A preﬁx DAG G and an automaton M . . . . . . . . . . . . . . . . . . 57

4.3 Diagram of the algorithm to generate negative r-contiguous detectors set. 61

4.4 An automaton represents 3-contiguous detectors set. . . . . . . . . . . . 62

4.5 Comparison of ratios of runtime of r-chunk detector-based NSA to run-

time of Chunk-NSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.6 Comparison of ratios of runtime of r-contiguous detector-based NSA to

runtime of Cont-NSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

vii

List of Tables

1.1 Performance comparison of NSAs on linear strings and ring strings. . . 24

2.1 Comparison of memory and detection time reductions. . . . . . . . . . 39

2.2 Comparison of nodes generation on Netﬂow dataset. . . . . . . . . . . . 40

3.1 Data and parameters distribution for experiments and results comparison. 49

4.1 Comparison of our results with the runtimes of previously published

algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.2 Comparison of Chunk-NSA with r-chunk detector-based NSA. . . . . . 63

4.3 Comparison of proposed Cont-NSA with r-contiguous detector-based NSA. 64

5.1 Features for NIDS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.2 Distribution of ﬂows and parameters for experiments. . . . . . . . . . . 73

5.3 Comparison between PSA2 and other algorithms. . . . . . . . . . . . . 74

5.4 Comparison between ring string-based PSA2 and linear string-based PSA2. 76

viii

Notation and Abbreviation

Notation

Length of data samples

(cid:96) Sr |X| Set of ring presentations of all strings in S Cardinality of set X

An alphabet, a nonempty and ﬁnite set of symbols

Σ Σk Set of all strings of length k on alphabet Σ, where k is a

positive integer. Set of all strings on alphabet Σ, including an empty string.

Matching threshold

Set of all positive r-chunk detectors at position i.

Set of all negative r-chunk detectors at position i.

Σ∗ r Dpi Dni CHUNKp(S, r) Set of all positive r-chunk detectors. Set of all negative r-chunk detectors. CHUNK(S, r)

CONT(S, r) Set of all r-contiguous detectors.

L(X) rcbvl Set of all nonself strings detected by X. r-contiguous bit with variable length.

Abbreviation

Artiﬁcial Immune System AIS

Accuracy Rate Ant Colony Optimization ACC ACO

Anomaly Network Intrusion Detection System ANIDS

Block-Based Neural Network BBNN

Chunk-NSA Chunk Detector-Based Negative Selection Algorithm Cont-NSA Contiguous Detector-Based Negative Selection Algorithm

DR Detection Rate

DAG Directed Acyclic Graph

FAR GA False Alarm Rate Genetic Algorithm

HIS Human Immune System

HIDS Host Intrusion Detection System

IDS Intrusion Detection System

Machine Learning ML

Multilayer Perceptron Network Intrusion Detection System MLP NIDS

Negative Selection NS

Negative Selection Algorithm NSA

Negative Selection Mutation Positive-Negative Selection Algorithm NSM PNSA

Positive Selection Algorithm PSA

Two-class Positive Selection Algorithm PSA2

PSO PSOGSA Particle Swarm Optimization Particle Swarm Optimization-Gravitational Search Algorithm

Real-valued NSA RNSA

Support Vector Machines SVM

Transmission Control Protocol Variable length detector-based NSA TCP VNSA

INTRODUCTION

Motivation

Internet users and computer networks are suﬀering from rapidly increasing num-

ber of attacks. In order to keep them safe, there is a need for eﬀective security monitor-

ing systems, such as Intrusion Detection Systems (IDS). However, intrusion detection

has to face a number of diﬀerent problems such as large network traﬃc volumes, im-

balanced data distribution, diﬃculties to realize decision boundaries between normal

and abnormal actions, and a requirement for continuous adaptation to a constantly

changing environment. As a result, many researchers have attempted to use diﬀerent

types of approaches to build reliable intrusion detection system.

Computational intelligence techniques, known for their ability to adapt and to

exhibit fault tolerance, high computational speed and resilience against noisy informa-

tion, are hopefully alternative methods to the problem.

One of the promising computational intelligence methods for intrusion detection

that have emerged recently are artiﬁcial immune systems (AIS) inspired by the biolog-

ical immune system. Negative selection algorithm (NSA), a dominating model of AIS,

is widely used for intrusion detection systems (IDS) [55, 52]. Despite its successful

application, NSA has some weaknesses: 1-High false positive rate (false alarm rate)

and false negative rate, 2-High training and testing time, 3-Exponential relationship

between the size of the training data and the number of detectors possibly generated for

testing, 4-Changeable deﬁnitions of ”normal data” and ”abnormal data” in dynamic

network environment [55, 79, 92]. To overcome these limitations, trends of recent works

are to concentrate on complex structures of immune detectors, matching methods and

hybrid NSAs [11, 94, 52].

Following trends mentioned above, in this thesis we investigate the ability of

NSA to combine with other classiﬁcation methods and propose more eﬀective data

representations to improve some NSA’s weaknesses.

Scientiﬁc meaning of the thesis: to provide further background to improve per-

formance of AIS-based computer security ﬁeld in particular and IDS in general.

Reality meaning of the thesis: to assist computer security practicers or experts

implement their IDS with new features from AIS origin.

The major contributions of this research are: Propose a new representation of

data for better performance of IDS; Propose a combination of existing algorithms as

well as some statistical approaches in an uniform framework; Propose a complete and

non-redundant detector representation to archive optimal time and memory complex-

ities.

Objectives

Since data representation is one of the factors that aﬀect the training and testing

time, a compact and complete detector generation algorithm is investigated.

The thesis investigates optimal algorithms to generate detector set in AIS. They

help to reduce both training time and detecting time of AIS-based IDSs.

Also, it is regarded to propose and investigate an AIS-based IDS that can

promptly detect attacks, either if they are known or never seen before. The proposed

system makes use of AIS with statistics as analysis methods and ﬂow-based network

traﬃc as experimental data.

Problem statements

Since the NSA has some limitations as listed in the ﬁrst section, this thesis

concentrates on three problems:

1. The ﬁrst problem is to ﬁnd compact representations of data. Objectives of this

problem’s solution is not only to minimize memory storage but also to reduce

testing time.

2. The second problem is to propose algorithms that can reduce training time

and testing time in compared with all existing related algorithms.

3. The third problem is to improve detection performance with respect to reduc-

ing false alarm rates while keeping detection rate and accuracy rate as high as

possible.

Solutions of these problems can partly improve ﬁrst three weaknesses as listed in the

ﬁrst section. Regarding to the last NSAs’ weakness about changeable deﬁnitions of

”normal data” and ”abnormal data” in dynamic network environment, we consider it

as a risk in our proposed algorithm and left it for future work.

Logically, it is impossible to ﬁnd an optimal algorithm that can both reduce time

and memory complexities and obtain best detection performance. These aspects are

always in conﬂict with each other. Thus, in each chapter, we will propose algorithms

to solve each problem quite independently.

The intrusion detection problem mentioned in this thesis can be informally

stated as:

Given a ﬁnite set S of network ﬂows which labeled with self (normal) or nonself

(abnormal). The objective is to build classifying models on S that can label exactly an

unlabeled network ﬂow s.

Outline of thesis

The ﬁrst chapter introduces the background knowledge necessary to discuss the

algorithms proposed in following chapters. First, detection of network anomalies is

brieﬂy introduced. Following that, the human immune system, artiﬁcial immune sys-

tem, machine learning and their relevance are reviewed and discussed. Then, popular

datasets used for experiments in the thesis are examined. related works.

In Chapter 2, a combination method of selection algorithms is presented. The

proposed technique helps to reduce detectors storage generated in training phase. Test-

ing time, an important measurement in IDS, will also be reduced as a direct consequence

of a smaller memory complexity. Tree structure is used in this chapter (and in Chapter

5) to improve time and memory complexities.

A complete and nonredundant detector set, also called perfect detectors set,

is necessary to archive acceptable self and nonself coverage of classiﬁers. A selection

algorithm to generate a perfect detectors set is investigated in Chapter 3. Each detector

in the set is a string concatenated from overlapping classical ones. Diﬀerent from

approaches in the other chapters, discrete structure of string-based detectors in this

chapter are suitable for detection in distributed environment.

Chapter 4 includes two selection algorithms for fast training phase. The optimal

algorithms can generate a detectors set in linear time with respect to size of training

data. The experiment results and theoretical proof show that proposed algorithms

outperform all existing ones in term of training time. In term of detection time, the

ﬁrst algorithm and the second one is linear and polynomial, respectively.

Chapter 5 mainly introduces a hybrid approach of positive selection algorithm

with statistics for more eﬀective NIDS. Frequencies of self and nonself data (strings) are

contained in leaves of trees representing detectors. This information plays an important

role in improving performance of the proposed algorithms. The hybrid approach came

as a new positive selection algorithm for two-class classiﬁcation that can be trained

with samples from both self and nonself data types.

Chapter 1 BACKGROUND

The human immune system (HIS) has successfully protected our bodies against

attacks from various harmful pathogens, such as bacteria, viruses, and parasites. It

distinguishes pathogens from self-tissue, and further eliminates these pathogens. This

provides a rich source of inspiration for computer security systems, especially intrusion

detection systems [92]. Hence, applying theoretical immunology and observed immune

functions, its principles, and its models to IDS has gradually developed into a new

research ﬁeld, called artiﬁcial immune system (AIS).

How to apply remarkable features of HIS to archive scalable and robust IDS

is considered a researching gap in the ﬁeld of computer security. In this chapter, we

introduce the background knowledge necessary to discuss the algorithms proposed in

following chapters that can partly fulﬁll the gap.

Firstly, a brief introduction to network anomaly detection is presented. We

then overview HIS. Next, immune selection algorithms, detectors, performance metrics

and their relevance are reviewed and discussed. Finally, some popular datasets are

examined.

1.1 Detection of Network Anomalies

The idea of intrusion detection is predicated on the belief that an intruder’s

behavior is noticeably diﬀerent from that of a legitimate user and that many unautho-

rized actions are detectable [65]. Intrusion detection systems (IDSs) are deployed as a

second line of defense along with other preventive security mechanisms, such as user

authentication and access control. Based on its deployment, an IDS can act either as

a host-based or as a network-based IDS.

1.1.1 Host-Based IDS

A Host-Based IDS (HIDS) monitors and analyzes the internals of a computing

system. A HIDS may detect internal activity such as which program accesses what

resources and attempts illegitimate access, for example, an activity that modiﬁes the

system password database. Similarly, a HIDS may look at the state of a system and

its stored information whether it is in RAM or in the ﬁle system or in log ﬁles or

elsewhere. Thus, one can think of a HIDS as an agent that monitors whether anything

or anyone internal or external has circumvented the security policy that the operating

system tries to enforce [12].

1.1.2 Network-Based IDS

A Network-Based IDS (NIDS) detects intrusions in network data. Intrusions

typically occur as anomalous patterns. Most techniques model the data in a sequential

fashion and detect anomalous subsequences. The primary reason for these anomalies

is the attacks launched by outside attackers who want to gain unauthorized access to

the network to steal information or to disrupt the network. In a typical setting, a

network is connected to the rest of the world through the Internet. The NIDS reads

all incoming packets or ﬂows, trying to ﬁnd suspicious patterns. For example, if a

large number of TCP connection requests to a very large number of diﬀerent ports are

observed within a short time, one could assume that there is someone committing a

port scan at some of the computers in the network. Port scans mostly try to detect

incoming shell codes in the same manner that an ordinary intrusion detection system

does. In addition to inspecting the incoming traﬃc, a NIDS also provides valuable

information about intrusion from outgoing or local traﬃc. Some attacks might even be

staged from the inside of a monitored network or network segment; and therefore, not

regarded as incoming traﬃc at all. The data available for intrusion detection systems

can be at diﬀerent levels of granularity, like packet level traces or Cisco netﬂow data.

The data is high dimensional, typically, with a mix of categorical as well as continuous

numeric attributes. Misuse-based NIDSs attempt to search for known intrusive patterns

while an anomaly-based intrusion detector searches for unusual patterns. Today, the

intrusion detection research is mostly concentrated on anomaly-based network intrusion

detection because it can detect both known and unknown attacks [12].

1.1.3 Methods

On the basis of the availability of prior knowledge, the detection mechanism

used, the mode of performance and the ability to detect attacks, existing anomaly

detection methods are categorized into six broad categories [41] as shown in Fig. 1.1.

This ﬁgure is adapted from [12].

Figure 1.1: Classiﬁcation of anomaly-based intrusion detection methods

AIS is a fairly new research subﬁeld of Computational intelligence. It was

considered as a system that acts intelligently: What it does is appropriate for its

circumstances and its goal; it is ﬂexible to changing environments and changing goals;

it learns from experience; also it makes appropriate choices given perceptual limitations

and ﬁnite computation [68].

1.1.4 Tools

IDS tools are used for purposes such as information gathering, victim identi-

ﬁcation, packet capture, network traﬃc analysis and visualization of traﬃc behavior.

These tools for both commercial and free purposes can be exampliﬁed, such as Snort,

Suricata, Bro, OSSEC, Samhain, Cisco Secure IDS, CyberCop, and RealSecure. Some

immune-related IDS tools including LISYS [10], which is based on TCP packages, and

MILA [26], a multilevel immune learning algorithm proposed for novel pattern recog-

nition.

However, despite their initially promising and inﬂuential properties, immune-

based IDSs never made it beyond the prototype stage [83]. Two main issues that

impeded the progress of immune algorithms were identiﬁed: large computational cost

to achieve acceptable coverage of the potentially anomalous region [54], and the failure

of these algorithms to generalize properly beyond the training set [79].

1.2 A brief overview of human immune system

Mainly being inspired by the human immune system, researchers have devel-

oped AISs intellectually and innovatively. Physical barriers, physiological barriers, an

innate immune system, and an adaptive immune system are main factors of a multi-

layered protection architecture included in our human immune system; among which,

the adaptive immune system being capable of adaptively recognizing speciﬁc types of

pathogens, and memorizing them for accelerated future responses is a complex of a

variety of molecules, cells, and organs spread all over the body [46]. Pathogens are for-

eign substances like viruses, parasites and bacteria which attack the body. Figure 1.2,

adapted from [77], presents a multi-layered protection and elimination architecture.

T cells and B cells cooperate to distinguish self from nonself. On the one hand,

T cells recognize antigens with the help of major histocompatibility complex (MHC)

molecules. Antigen presenting cells ingest and fragment antigens to peptides. MHC

molecules transport these peptides to the surface of antigen presenting cells. T cells,

whose receptors bind with these peptide-MHC combinations, are said to recognize

Figure 1.2: Multi-layered protection and elimination architecture

antigens. On the other hand, B cells recognize antigens by binding their receptors

directly to antigens. The bindings actually are chemical bonds between receptors and

epitopes. The more complementary the structure and the charge between receptors and

epitopes are, the more likely binding will occur. The strength of the bond is termed

aﬃnity. To avoid autoimmunity, T cells and B cells must pass a negative selection

stage, where lymphocytes matching self cells are killed.

Prior to negative selection, T cells undergo positive selection. This is because in

order to bind to the peptide-MHC combinations, they must recognize self MHC ﬁrst.

Thus, the positive selection will eliminate T cells with weak bonds to self MHC. T cells

and B cells, which survive the negative selection, become mature, and enter the blood

stream to perform the detection task. Since these mature lymphocytes have never

encountered antigens, they are naive. Naive T cells and B cells can possibly auto-react

with self cells, because some peripheral self proteins are never presented during the

negative selection stage. To prevent self-attack, naive cells need two signals in order

to be activated: one occurs when they bind to antigens, and the other is from other

sources as a conﬁrmation. Naive T helper cells receive the second signal from innate

system cells. In the event that they are activated, T cells begin to clone. Some of

the clones will send out signals to stimulate macrophages or cytotoxic T cells to kill

antigens, or send out signals to activate B cells. Others will form memory T cells. The

activated B cells migrate to a lymph node. In the lymph node, a B cell will clone itself.

Meanwhile, somatic hyper mutation is triggered, whose rate is 10 times higher than

that of the germ line mutation, and is inversely proportional to the aﬃnity. Mutation

changes the receptor structures of oﬀspring; hence oﬀspring have to bind to pathogenic

epitopes captured within the lymph nodes. If they do not bind, they will simply die

after a short time. Whereas, in case they succeed in binding, they will leave the lymph

node and diﬀerentiate into plasma or memory B cells.

In summary, the HIS is a distributed, self-organizing and lightweight defense

system for the body. These remarkable features fulﬁll and beneﬁt the design goals of

an intrusion detection system, thus resulting in a scalable and robust system [53].

1.3 AIS for IDS

1.3.1 AIS model for IDS

Figure 1.3 illustrates the steps necessary to obtain an AIS solution for a secu-

rity problem, as ﬁrstly envisioned by de Castro and Timmis [27] and latter adopted

by Fernandes et al. [35]. Firstly, the security domain of the system to model needs

to be identiﬁed. Secondly,the immune entities that best ﬁt the needs of the system

should be picked from the immunological theories. That should ease pointing out the

representation of the entities. In the step of the aﬃnity measures one should take into

account a matching rule that outputs if two elements should bind.

Figure 1.3: Multi-layer AIS model for IDS

1.3.2 AIS features for IDS

According to Kim et al. [55], AIS features can be illustrated and summarized

as follows.

Firstly, a distributed IDS supports robustness, conﬁgurability, extendibility and

scalability. It is robust since the failure of one local intrusion detection process does

not cripple the overall IDS. It is also easy to conﬁgure a system since each intrusion

detection process can be simply tailored for the local requirements of a speciﬁc host.

The addition of new intrusion detection processes running on diﬀerent operating sys-

tems does not require modiﬁcation of existing processes and hence it is extensible. It

can also scale better, since the high volume of audit data is distributed amongst many

local hosts and is analyzed by those hosts.

Secondly, a self-organizing IDS provides adaptability and global analysis. With-

out external management or maintenance, a self organizing IDS automatically detects

intrusion signatures which are previously unknown and/or distributed, and eliminates

and/or repairs compromised components. Such a system is highly adaptive because

there is no need for manual updates of its intrusion signatures as network environments

change. Global analysis emerges from the interactions among a large number of varied

intrusion detection processes.

Next, a lightweight IDS supports eﬃciency and dynamic features. A lightweight

IDS does not impose a large overhead on a system or place a heavy burden on CPU

and I/O. It places minimal work on each component of the IDS. The primary functions

of hosts and networks are not adversely aﬀected by the monitoring. It also dynami-

cally covers intrusion and non-intrusion pattern spaces at any given time rather than

maintaining entire intrusion 8 and non-intrusion patterns.

One more important feature is a multi-layered IDS which increases robustness.

The failure of one-layer defense does not necessarily allow an entire system to be

compromised. While a distributed IDS allocates intrusion detection processes across

several hosts, a multi-layered IDS places diﬀerent levels of sensors at one monitoring

place.

Additionally, a diverse IDS provides robustness. A variety of diﬀerent intrusion

detection processes spread across hosts will slow an attack that has successfully com-

promised one or more hosts. This is because an understanding of the intrusion process

at one site provides limited or no information on intrusion processes at other sites.

Finally, it is a disposable IDS that increases robustness, extendibility and conﬁg-

urability. A disposable IDS does not depend on any single component. Any component

can be easily and automatically replaced with other components. These properties are

important in an eﬀective IDS, as well as being established properties of the HIS.

1.4 Selection algorithms

The main developments within AIS have focussed on three immunological the-

ories: clonal selections, immune networks and negative selections. Negative selection

approaches are based on self-nonself discrimination in biology system. This property

makes it attractive for computer and network security researchers. A survey by G. C.

Silva and D. Dasgupta in [71] showed that in ﬁve-year period 2008-2013, NSA predom-

inated all the other models of AIS in term of published papers relating to both network

security and anomaly detection. This trend triggers for much of the research work in

this thesis.

A model of AIS, positive selection algorithm (PSA), is also investigated. Under

some conditions, we will prove in a follow section that PSA is adequate to NSA in term

of anomaly detection performance.

1.4.1 Negative Selection Algorithms

Negative selection is a mechanism employed to protect the body against self-

reactive lymphocytes. Such lymphocytes can occur because the building blocks of

antibodies are diﬀerent gene segments that are randomly composed and undergo a fur-

ther somatic hypermutation process. Therefore, this process can produce lymphocytes

which are able to recognise self-antigens [85].

NSAs are among the most popular and extensively studied techniques in ar-

tiﬁcial immune systems that simulate the negative selection process of the biological

immune system. Stephanie Forrest et al. [38] proposed an algorithmic model of this

process, which can be considered as a classiﬁer that learns from only self samples

(negative examples)1.

A typical NSA comprises of two phases: detector generation and detection [7,

50]. In the detector generation phase (Fig. 1.4.a), the detector candidates are generated

by some random processes and censored by matching them against given self samples

taken from a set S (representing the system components). The candidates that match

any element of S are eliminated and the rest are kept and stored in the detector

set D. In the detection phase (Fig. 1.4.b), the collection of detectors are used to

distinguish self (system components) from nonself (outlier, anomaly, etc.). If incoming

data instance matches any detector, it is claimed as nonself or anomaly. Figure 1.4 is

adapted from [38].

Figure 1.4: Outline of a typical negative selection algorithm.

Concept matching or recognition, are used both in the detector generation phase

and in the anomaly detection phase. Regardless of representation, a matching rule on

a detector d and a data sample s can be informally deﬁned as a distance measure

between d and s within a threshold. Matching threshold exposes the concept of partial

1At the writing time of this thesis, the paper has been cited more than 2300 times

matching: two points do not have to be exactly the same to be considered matching.

A partial matching rule can support an approximation or a generalization in the algo-

rithms. The choice of the matching rule or the threshold in a matching rule must be

application speciﬁc and representation dependent [51]. For real-valued representation,

some popular rules are Euclidean distance and Manhattan distance. In string represen-

tation, rcb(r-contiguous bits) matching rule and r-chunk matching rule are the most

famous ones and they are formally presented in following section.

Since its introduction, NSA has had many applications such as in computer virus

detection [37, 5], monitoring UNIX processes [36], anomaly detection [22, 26], intrusion

detection [19, 54, 46, 59, 18, 93], scheduling [64], fault detection and diagnosis [45, 72],

negative database [33, 98], negative authentication [25, 20]. Moreover, NSA has been

quite successfully applied in immunology where they are used as models to provide

insight into fundamental principles of immunity and infection [15], and to illustrate

the immunological processes such as HIV infection [56, 57].

The most signiﬁcant characteristics of a NSA making its uniqueness and strength

are:

• No prior knowledge of nonself is required [29].

• It is inherently distributable; no communication between detectors is needed [30].

• It can hide the self concept [33].

• Compared with other change detection methods, NSAs do not depend on the

knowledge of deﬁned normal. Consequently, checking activity of each site can

be based on a unique signature of each while the same algorithm is used over

multiple sites.

• The quality of the check can be traded oﬀ against the cost of performing a check

[38].

• Symmetric protection is provided so the malicious manipulation on detector set

can be detected by normal behavior of the system [38].

• If the process of generating detectors is costly, it can be distributed to multiple

sites because of its inherent parallel characteristics.

• Detection is tunable to balance between coverage (matching probability) and the

number of detectors [29].

1.4.2 Positive Selection Algorithms

Contrary to NSAs, PSAs have been less studied in the literature. PSAs are

mainly developed and applied in intrusion detection [23, 73, 44, 66], malware detec-

tion [39], spam detection [81], and classiﬁcation [40, 67]. Stibor et al. [80] argues that

positive selection might have better detection performance than negative one. How-

ever, for problems and applications that the number of detectors generated by NSAs is

much less than that of self samples, negative selection is obviously a better choice [51].

Similar to NSA, a PSA contains two phases: detector generation and detection.

In the detector generation phase (Fig. 1.5.a), the detector candidates are generated by

some random processes and matched against the given self sample set S. The candi-

dates that do not match any element in S are eliminated and the rest are kept and

stored in the detector set D. In the detection phase (Fig. 1.5.b), the collection of de-

tectors are used to distinguish self from nonself. If incoming data instance matches any

detector, it is claimed as self. In other words, detectors modeling involves generating a

Figure 1.5: Outline of a typical positive selection algorithm.

set of strings (patterns) that do not match any string in a training dataset too strongly

(negative selection) or weakly match at least one string from the same dataset (positive

selection). Having obtained the detectors, one usually examines a set of testing dataset

(i.e., ”antigens”), for which we search one or all matching detectors for classiﬁcation.

1.5 Basic terms and deﬁnitions

In selection algorithms, an essential component is the matching rule which de-

termines the similarity between detectors and self samples (in the detector generation

phase) and coming data instances (in the detection phase). Obviously, the matching

rule is dependent on detector representation. In this thesis, both self and nonself cells

are represented as strings of ﬁxed length. This representation is a simple and popular

representation for detectors and data in AIS, and other representations (such as real

valued) could be reduced to binary, a special case of string [42, 51].

1.5.1 Strings, substrings and languages

An alphabet Σ is nonempty and ﬁnite set of symbols. A string s ∈ Σ∗ is a

sequence of symbols from Σ, and its length is denoted by |s|. A string is called empty

string if its length equals 0. Given an index i ∈ {1, . . . , |s|}, then s[i] is the symbol

at position i in s. Given two indices i and j, whenever j ≥ i, then s[i . . . j] is the

substring of s with length j − i + 1 that starts at position i and whenever j < i, then

s[i . . . j] is the empty string. If i = 1, then s[i . . . j] is a preﬁx of s and, if j = |s|,

then s[i . . . j] is a suﬃx of s. For a proper preﬁx or suﬃx s(cid:48) of s, we have in addition

|s(cid:48)| < |s|. Given a string s ∈ Σ(cid:96), another string d ∈ Σr with 1 ≤ r ≤ (cid:96), and an index

i ∈ {1, . . . , (cid:96) − r + 1}, we say that d occurs in s at position i if s[i . . . i + r − 1] = d.

Moreover, concatenation of two strings s and s(cid:48) is s + s(cid:48).

A set of strings S ⊆ Σ∗ is called a language. For two indices i and j, we deﬁne

S[i . . . j] = {s[i . . . j]|s ∈ S}.

1.5.2 Preﬁx trees, preﬁx DAGs and automata

A preﬁx tree T is a rooted directed tree with edge labels from Σ where for all

c ∈ Σ, every node has at most one outgoing edge labeled with c. For a string s, we write

s ∈ T if there is a path from the root of T to a leaf such that s is the concatenation

of the labels on this path. The language L(T ) described by T is deﬁned as the set of

all strings that have a nonempty preﬁx s ∈ T . For example, for T as in Fig. 1.6.a we

have 0 ∈ T and 10 ∈ T , but 1 (cid:54)∈ T . Furthermore, 0 ∈ L(T ), 01 ∈ L(T ) since 0 ∈ T and

11 (cid:54)∈ L(T ) since no preﬁx of 11 lies in T . Trees for self dataset and nonself dataset are

called positive trees and negative trees, respectively.

A preﬁx DAG D is a directed acyclic graph with edge labels from Σ, where

again for all c ∈ Σ, every node has at most one outgoing edge labeled with c. Similar

to preﬁx trees, the terms root and leaf used to refer to a node without incoming and

outgoing edges, respectively. We write s ∈ D if there is a path from a root node to

a leaf node in D that is labeled by s. Given n ∈ D, the language L(D, n) contains

all strings that have a nonempty preﬁx that labels a path from n to some leaf. For

instance, if D is the DAG in Fig. 1.6.b and n is its lower left node, then L(D, n) consists

of all strings starting with 11. Moreover, we deﬁne L(D) = ∪n is a root of DL(D, n).

A ﬁnite automaton is a tuple M = (Q, q0, Qa, Σ, ∆), where Q is a set of states

with a distinguished initial state q0 ∈ Q, Qa ⊆ Q the set of accepting states, Σ the

alphabet of M , and ∆ ⊆ Q × Σ × Q the transition relation. Furthermore, we assume

that the transition relation is unambiguous: for every q ∈ Q and every c ∈ Σ there is

at most one q(cid:48) ∈ Q with (q, c, q(cid:48)) ∈ ∆. It is common to represent the transition relation

as a graph with node set Q (with the initial state and the accepting states highlighted

properly) and labeled edges (a c-labeled edge from q to q(cid:48) if q(cid:48) ∈ Q with (q, c, q(cid:48)) ∈ ∆).

An automaton M is said to accept a string s if its graph contains a path from q0 to

some q ∈ Qa whose concatenated edge labels equal s (note that this path may contain

loops). The language L(M ) contains all strings accepted by M .

A preﬁx DAG can be turned into a ﬁnite automaton to decide the membership

of strings in languages. The details steps of this process is presented in Chapter 4.

Figure 1.6: Example of a preﬁx tree and a preﬁx DAG.

1.5.3 Detectors

In PSAs and NSAs, an essential component is the matching rule which deter-

mines the similarity between detectors and self samples (in the detector generation

phase) and coming data instances (in the detection phase). Obviously, the matching

rule is dependent on detector representation. For string based AIS, the r-chunk and

r-contiguous detectors are among the most common matching rules. A r-chunk match-

ing rule can be seen as a generalisation of the r-contiguous matching rule, which helps

AIS to achieve better results on data where adjacent regions of the input data sequence

are not necessarily semantically correlated, such as in network data packets [9].

An important diﬀerence between rcb and r-chunk matching rules is holes, or the

undetectable strings, that they may induce. This concept is presented in Section 1.5.5.

Given a nonempty alphabet Σ and ﬁnite set of symbols, positive and negative

r-chunk detectors, r-contiguous detectors, rcbvl-detectors could be deﬁned as follows:

Deﬁnition 1.1 (Positive r-chunk detectors). Given a self set S ⊆ Σ(cid:96), a tuple (d, i) of

a string d ∈ Σr, where r ≤ (cid:96), and an index i ∈ {1, ..., (cid:96) − r + 1} is a positive r-chunk

detector if there exists a s ∈ S such that d occurs in s.

Deﬁnition 1.2 (Negative r-chunk detectors). Given a self set S ⊆ Σ(cid:96), a tuple (d, i) of

a string d ∈ Σr, r ≤ (cid:96), and an index i ∈ {1, ..., (cid:96) − r + 1} is a negative r-chunk detector

if d does not occurs any s ∈ S.

Although some proposed approaches in following chapters can be implemented

on any ﬁnite alphabet, binary strings used in all examples are binary, Σ = {0, 1}, just

for easy understanding.

Example 1.1. Let (cid:96) = 6, r = 3. Given a set S of ﬁve self strings: s1 = 010101, s2

= 111010, s3 = 101101, s4 = 100011, s5 = 010111. The set of some positive r-chunk

detectors is {(010,1), (111,1), (101,2), (110,2), (010,3), (101,3), (101,4), (010,4),

(111,4))}. The set of some negative r-chunk detectors is {(000,1), (001,1), (011,1),

(001,2), (010,2), (100,2), (000,3), (100,3), (000,4), (001,4), (100,4)}.

Deﬁnition 1.3. Given a self set S ⊆ Σ(cid:96), a string d ∈ Σ(cid:96) is a r-contiguous detector if

d[i, . . . , i + r − 1] does not occurs any s ∈ S for all i ∈ {1, ..., (cid:96) − r + 1}.

Example 1.2. Let (cid:96) = 5, r = 3. Given a set of 7 self strings S = {01111, 00111,

10000, 10001, 10010, 10110, 11111}. The set of all 3-contiguous detectors is {01011,

11011}. This example is adapted from [32].

We also use the following notations:

• Dpi = {(d, i)|(d, i) is a positive r-chunk detector} is set of all positive r-chunk

detectors at position i, i = 1, . . . , (cid:96) − r + 1.

• Dni = {(d, i)|(d, i) is a negative r-chunk detector} is set of all negative r-chunk

detectors at position i, i = 1, . . . , (cid:96) − r + 1.

i=1 Dpi is set of all positive r-chunk detectors.

• CHU N Kp(S, r) = ∪(cid:96)−r+1

i=1 Dni is set of all negative r-chunk detectors.

• CHU N K(S, r) = ∪(cid:96)−r+1

• CONT(S, r) is the set of all r-contiguous detectors that do not match any string

in S.

• For a given detectors set X, L(X) is the set of all nonself strings detected by X.

We also say that Σ(cid:96) \ L(X) is the set of all self strings detected by X.

Example 1.3. Let (cid:96) = 5, matching threshold r = 3. Suppose that we have the set S

of six self strings s1 = 00000, s2 = 00010, s3 = 10110, s4 = 10111, s5 = 11000, s6 =

11010. Dp1 = {(000,1), (101,1), (110,1)} (Dp1 is set of all left most substrings of

length r of all s ∈ S), Dn1 = {(001,1), (010,1), (011,1), (100,1), (111,1)}, Dp2 =

{(000,2), (001,2), (011,2), (100,2), (101,2)}, Dn2 = {(010,2), (110,2), (111,2)}, Dp3

= {(000,3), (010,3), (110,3), (111,3)}, Dn3 = {(001,3), (011,3), (100,3), (101,3)}

(note that Dpi ∪ Dni = Σ3, i = 1, 2, 3).

The self space covered by the set of CHUNKp(S, 3) is {0, 1}5\ L(CHUNKp(S, 3))

= {00000, 00001, 00010, 00011, 00110, 00111, 01000, 01001, 01010, 01011, 01110,

01111, 10000, 10001, 10010, 10011, 10100, 10101, 10110, 10111, 11000, 11001, 11010,

11011, 11110, 11111}. Set of all strings detected by CHUNK(S,3) is L(CHUNK(S,3))

= {00001, 00011, 00100, 00101, 00110, 00111, 01000, 01001, 01010, 01011, 01100,

01101, 01110, 01111, 10000, 10001, 10010, 10011, 10100, 10101, 11001, 11011, 11100,

11101, 11110, 11111}.

Deﬁnition 1.4. Given a self set S ⊆ Σ(cid:96). A triple (d, i, j) of a string d ∈ Σk, 1 ≤ k ≤ (cid:96),

an index i ∈ {1, ..., (cid:96) − r + 1} and an index j ∈ {i, ..., (cid:96) − r + 1} is called a negative

detector under rcbvl matching rule if d does not occur in any s, s ∈ S.

In other words, a triple (d, i, j) is a rcbvl detector if there exist (j − i + 1) r-chunk

detectors (d1, i),..., (dj−i+1, j) that dk, dk+1 are two (r − 1)-symbol overlapping strings,

k = 1, ..., j − i.

Example 1.4. Given (cid:96), r and the set S of self strings as in Example 1.1, S = {010101,

111010, 101101, 100011, 010111}. Triple (0001,1,2) is a rcbvl detector because there

exist two 3-chunk detectors (000,1), (001,2) that 000 and 001 are two 2-bit overlapping

strings. A detectors set under rcbvl matching rule contains 5 variable length detec-

tors {(0001,1,2), (00100,1,4), (100,4,4), (011110,1,4), (11000,1,3)}. It is a minimum

detectors set (23 bits) that covers all detector space of r-chunk detectors set in Exam-

ple 1.1 (45 bits).

Matching threshold r plays an important role in selection algorithms. The

value of r can be used to balance between underﬁtting and overﬁtting. Our proposed

methods in Chapter 5 investigate this value in combination with simple statistics for

better detection performance.

1.5.4 Detection in r-chunk detector-based positive selection

It could be seen from Example 1.3 that L(CHU N Kp(S, r)) = {00100, 00101,

01100, 01101, 11100, 11101} (cid:54)= L(CHU N K(S, r)), so the detection coverage of Dn

is not the same as that of Dp. This is undesirable for the combination of PSA and

NSA. Hence, to combine PSA and NSA in a uniﬁed framework, we have to change the

original semantic of positive selection in the detection phase as follows.

Deﬁnition 1.5 (Detection in positive selection). If new instance matches (cid:96) − r + 1

positive r-chunk detectors (dij , i), i = 1, . . . , (cid:96) − r + 1, it is claimed as self, otherwise it

is claimed as nonself.

With this new detection semantic, the following theorem on the equivalence of

detection coverage of r-chunk type PSA and NSA could be stated.

Theorem 1.1 (Detection Coverage). The detection coverage of positive and negative

selection algorithms coincide.

(1.1) L(CHU N Kp(S, r)) = L(CHU N K(S, r))

Proof. From the description of NSAs (see Fig. 1.4), if a new data instance matches

a negative r-chunk detector, then it is claimed as nonself, otherwise it is claimed as

self. Obviously, it is dual to the detection of new data instances in positive selection

as given in Deﬁnition 1.5.

This theorem lays the foundation for our novel Positive-Negative Selection Al-

gorithm (PNSA) proposed in next chapter.

1.5.5 Holes

The generalization performed by selection algorithms corresponds to the strings

that are labeled self even though they do not occur in the self sample S. These strings

are called holes. There are two types of holes: crossover holes and length-limit holes in

rcb matching rules literature. A crossover hole is a crossover of certain self strings, in

which all substrings of length r occur in self strings. A length-limit hole is one that has

at least one substring of length r that exists in CHUNK(S, r). The r-chunk matching

rule eliminates the problem of length-limit holes.

Fig. 1.7 illustrates the existence of holes in a self and a nonself space comprised

by self and detector strings. The string universe Σ(cid:96) is a squared region. Each dark

circle represents a detector and a grid shape in the middle is self. The universe is

Figure 1.7: Existence of holes.

classiﬁed by the detector set as self (grid region and holes - white region) and nonself

(dark region covered by all circles).

In fact, holes are not a ”problem”, but as pointed out by Stibor et al. [78], they

are a necessary property of the selection algorithms. Without holes, the algorithms

would do nothing but memorize the training data for classiﬁcation naively.

Fig. 1.8 shows a set of seven self strings as in Example 1.2, S ⊂ {0, 1}5 (left)

along with CHUNK(S, 3) (middle) and CONT(S, 3) (right). For both detector types,

the induced bipartitionings of the shape space {0, 1}5 are illustrated with strings that

are classiﬁed as nonself having a gray background and strings that are classiﬁed as self

having a white background. Bold strings are members of the self-set. Holes are the

strings that are classiﬁed as self but do not occur in the self-set S (non-bold, non-shaded

strings). This ﬁgure is adapted from [32].

1.5.6 Performance metrics

We used three metrics to evaluate each machine learning technique. Detection

rate (DR), Accuracy rate (ACC) and False alarm rate (FAR) are deﬁned as:

(1.2) DR = T P T P + F N

ACC = (1.3)

(1.4) F AR = T P + T N T P + F N + F P + F N F P F P + T N

Figure 1.8: Negative selections with 3-chunk and 3-contiguous detectors.

Where TP (True Positive) is the number of true positives (correctly classiﬁed

as nonself), TN (True Negative) is the number of true negatives (correctly classiﬁed

as self), TP (False Positive) is the number of false positives (classiﬁed as nonself,

actually self), and FN (False Negative) is the number of false negatives (classiﬁed as

self, actually nonself).

We used 10-fold cross-validation technique and holdout one to evaluate our ap-

proaches in experiments. Regarding the former, the dataset was randomly partitioned

into 10 subsets. Of the 10 subsets, a single subset was retained for testing, and the

others were used as training data. The process was then repeated 10 times, with each

of the 10 subset used exactly once as the testing data. The 10 results from the folds

were then averaged to produce a single performance. Regarding the later, dataset is

split into two groups: training set used to train the classiﬁer and test set (or hold out

set) used to estimate the performance of classiﬁer.

1.5.7 Ring representation of data

As we known, most of AIS-based applications use two types of data representa-

tion: string and real-valued vector. For both popular types, representations are linear

structure of symbols or numbers. They may omit information at the edges (the begin

and the end) of these structures. For example, to detect a new instance s as self or

nonself in detection phases of PSA and NSA, it takes (cid:96) − r + 1 steps in the worst case

to match s again each detector at positions 1,...,(cid:96) − r + 1. Therefore, for given detector

d, the symbols d[1] and s[1] are used in only one match, the symbols d[2] and s[2] are

used in two matches, etc. In the other words, all positions in linear structures are not

equal in term of matching times.

Our earlier experimental implementation of NSA on binary ring-based strings

produces a trigger to solve this problem. A set of 50,000 random self strings and a set of

10,000 random nonself strings with the same length of 50 were used in the experiment.

Table 1.1 shows the experimental results on values of r ranging from 10 to 16 under

10 cross-validation technique. The results show that both detection rate and accuracy

rate of ring-based NSA are higher than that of the linear-based one, while false alarm

rates are relatively similar. Accordingly, we could use ring structures instead of linear

ones for more exact classiﬁcation.

Table 1.1: Performance comparison of NSAs on linear strings and ring strings.

10 11 12 13 14 15 16 NAS on linear strings FAR ACC DR 0.0008 0.0102 0.8343 0.0051 0.0535 0.8380 0.0177 0.1817 0.8488 0.0399 0.4054 0.8677 0.0640 0.6456 0.8875 0.0829 0.8293 0.9023 0.0934 0.9340 0.9112 NSA on ring strings FAR DR 0.0010 0.0123 0.0063 0.0655 0.0212 0.2193 0.0465 0.4704 0.0719 0.7184 0.0888 0.8888 0.0964 0.9662 ACC 0.8345 0.8390 0.8522 0.8723 0.8932 0.9075 0.9139

With reference to string-based detectors set, a simple technique for this approach

is to concatenate each string representing a detector with its ﬁrst k symbols. Each new

linear string is a ring representation of its original binary one. Fig. 1.9 shows a ring

representation (b) of its original string (a) with k = 3.

Given a set of strings S ⊂ Σ(cid:96), a set Sr ⊂ Σ(cid:96)+r−1 includes ring representations

of all strings in S by concatenating each string s ∈ S with its ﬁrst r − 1 symbols.

Note that we can easily apply the idea of ring strings for other data representa-

Figure 1.9: A simple ring-based representation (b) of a string (a).

tions in AIS. One way to do this, for instance, is to create ring representations of other

structures such as trees, and automata, from set Sr instead of S as usual.

1.5.8 Frequency trees

Given a set D of length-equaled strings, a tree T on D, noted TD, is a rooted

directed tree with edge labels from Σ where for all c ∈ Σ, every node has at most one

outgoing edge labeled with c. For a string s, we write s ∈ T if there is a path from the

root of T to a leaf such that s is the concatenation of the labels on this path. Each

leaf is associated with an integer number, that is frequency of string s ∈ D and s is

the concatenation of the labels on the path ending by this leaf. This tree structure is

a compact representation of r-chunk detectors in our algorithm in Chapter 5.

Example 1.5. Let (cid:96) = 5 matching threshold r = 3. Suppose that we have the

set S of four strings: s1 = 00000, s2 = 10110, s3 = 10111, s4 = 11111. Sr =

{0000000, 1011010, 1011110, 1111111}. S1 = {(000,1), (101,1), (111,1)}, S2 =

{(000,2), (011,2), (111,2)}, S3 = {(000,3), (110,3), (111,3)}, S4= {(000,4), (101,4),

(111,4)}, S5 = {(000,5), (010,5), (110,5), (111,5)}.

Assuming that S = {N, A}, where set of normal data N = {s1, s2} and set of

abnormal data A = {s3, s4}.

Nr = {0000000, 1011010} (ring representations of all strings in N ). N1 =

{(000,1), (101,1)}, N2 = {(000,2), (011,2)}, N3 = {(000,3), (110,3)}, N4= {(000,4),

(101,4)}, N5 = {(000,5), (010,5)}.

Ar = {1011110, 1111111}. A1 = {(101,1), (111,1)}, A2 = {(011,2), (111,2)},

A3 = {(111,3)}, A4= {(111,4)}, A5 = {(110,5), (111,5)}.

Ten trees presenting all 3-chunk detectors are in Fig. 1.10. Five trees TNi (TAi),

i = 1, . . . , 5, are in the ﬁrst (second) row, from left to right, respectively.

Call to mind that there are some strings that belong to both positive trees and

Figure 1.10: Frequency trees for all 3-chunk detectors.

negative trees. A substring of s2, s2[1 . . . 3] = 101, for example, s2[1 . . . 3] ∈ TN1 and

s2[1 . . . 3] ∈ TA1. This situation could lead to error rates in detection phase. Therefore,

frequencies of matches will be used to improve detection performance of algorithms.

Detailed technique of using the frequencies is presented in Chapter 5.

1.6 Datasets

There are two basis NIDSs which depend on the source of data to be analyzed:

packet-based NIDSs and ﬂow-based ones. M. H. Bhuyan et al. reviewed in [13] that

most of NIDSs are based on packet-based data. However, we only concentrate on

ﬂow-based NIDSs because of three reasons: 1). It can detect some special attacks,

like DDoS or RDP Brute Force, more eﬃcient and faster than payload based one,

since lesser information is needed to be analyzed [87, 76]; 2). Flow-based anomaly

detection methods process only packet headers and, therefore, reduce data amount and

processing time for high-speed detection on large networks. It can solve the scalability

problem in condition of increasing network usage and load [76]. 3). Flow-based NIDSs

decrease privacy issues in comparison with packet-based ones because of the absence

of payload [76].

1.6.1 The DARPA-Lincoln datasets

So as to evaluate the performance of diﬀerent intrusion detection methodologies,

MITs Lincoln laboratory with the sponsor from the DARPA ITO and Air Force Re-

search laboratory, gathered the DARPA Lincoln datasets which consist of nine weeks

in 1998: seven weeks for training and two remaining weeks for test data. The datasets

are collected and stored in the form of Tcpdump, they are data source for extracting

to some datasets such as KDD99 [3], and NetFlow [86].

There were more than 300 instances of 38 diﬀerent attacks launched against

victim UNIX hosts in the attack date which can fall into one of the four categories:

Denial of Service (DoS), Probe, Users to Root (U2R), and Remote to Local (R2L).

For each week, inside and outside network traﬃc data, audit data recorded by the

Basic Security Module on Solaris hosts, and ﬁle system dumped from UNIX hosts were

collected. In 1999, another series of datasets which contained three weeks of training

and two weeks of test date was collected. More than 200 instances of 58 attack types

were launched against victim UNIX and WindowsNT hosts and a Cisco router. In 2000,

there were three additional scenario-speciﬁc datasets generated to address distributed

DoS and Windows NT attacks. Detailed descriptions of these datasets can be found

at [1].

1.6.2 UT dataset

A public labeled ﬂow-based dataset is provided in [75]. This dataset was cap-

tured by monitoring a honeypot hosted in the University of Twente network, so we

call it TU dataset. This dataset has three categories: malicious traﬃc, unknown traﬃc

and side-eﬀect traﬃc. It has 14,170,132 ﬂows which are mostly of malicious nature.

Only a small number of ﬂows, 5,968 ﬂows or 0.042%, are not labeled (unknown) and

considered as normal data in our experiments in following chapters. Each ﬂow in the

dataset has 13 ﬁelds: id: the ID of the ﬂow, src ip: anonymized source IP address

(encoded as 32-bit number), dst ip: anonymized destination IP address (encoded as

32-bit number), packets: number of packets in ﬂow, octets: number of bytes in ﬂow,

start time: UNIX start time (number of seconds), start msec: start time (milliseconds

part), end time: UNIX end time (number of seconds), end msec: end time (millisec-

onds part), src port: source port number, dst port: destination port number, tcp ﬂags:

TCP ﬂags of the ﬂow, and prot: IP protocol number.

Examples of a normal ﬂow and an attack ﬂow are (393, 3145344965, 2463760020,

3, 168, 1222173606, 974, 1222173610, 239, 0, 769, 0, 1) and (1, 2463760020, 3752951033,

1, 60, 1222173605, 985, 1222173605, 985, 4534, 22, 2, 6), respectively. The ID of a ﬂow

is used to distinguish attack ﬂows from the others.

1.6.3 Netﬂow dataset

Packet-based DARPA dataset [1] is used to generate ﬂow-based DARPA dataset,

called NetFlow, in [86]. This dataset focuses only on ﬂows to a speciﬁc port and an IP

address which receives the highest number of attacks. It includes all 129,571 traﬃcs

(including attacks) to and from victims. Each ﬂow in the dataset has 10 ﬁelds: Source

IP, Destination IP, Source Port, Destination Port, Packets, Octets, Start Time, End

Time, Flags, and Proto. All 24,538 attacked ﬂows are labeled with text labels, such as

neptune, portsweep, ftpwrite, etc.

Examples of a normal ﬂow and an attack ﬂow are (172.16.112.20, 172.16.112.50,

53, 32961, 1, 161, 1999-03-05T08:17:10, 1999-03-05T08 :17:10, 17, 00) and (209.167.99.71,

172.16.112.50, 10353, 4288, 1, 46, 1999-03-12T17:23:19, 1999-03-12T17:23:19, 6, 02:::port-

sweep), respectively.

1.6.4 Discussions

This thesis does not mention any techniques or algorithms for NIDS on a variety

of datasets. Although there exist many other well-known datasets for IDS, such as

KDD99 datasets [3], NSL-KDD dataset [4], and FDFA datasets [2], they are all out of

scope of this netﬂow-related thesis.

There have been a number of diﬀerent viewpoints and researches which argue

the DARPA datasets. McHugh [62] proposed one of the most important assessment

of DARPA datasets which was deeply critical. In the other words, this assessment

gives some examples such as normal and attack data have unrealistic data rates, the

lack of training datasets for anomaly detection for its intended purpose or no eﬀorts

for validation; to show that some evaluation methodologies are questionable and may

have biased the results. Furthermore, the ideas from Malhony and Chan [60] which

can be seen as a conﬁrmation of ﬁndings owned by McHughs’s experiments, revealed

that many attributes had small and ﬁxed ranges in simulation, but large and growing

ranges in real traﬃc.

Although there exist some above argumentative theories, DARPA-Lincoln dataset

plays a vital role in publicity and is the most sophisticated benchmark for researchers

in the process of the evaluation on intrusion detection algorithms or machine learning

algorithms [92]. One of the important factor of this data is to be a proxy for developing,

testing and evaluating detecting algorithms, rather than to be a solid dataset for a real

time system. If a detection algorithm has a high performance based on the DARPA

data, it is more likely to have a good performance in real network environment [88].

1.7 Summary

This chapter presents background topics used in this thesis including HIS, IDS,

and AIS. Some terms and deﬁnitions that will be used throughout the thesis are also

stated clearly. Two important data structure, ring-based and frequency-based, can be

used for improving classiﬁcation rate of selection algorithms. Some popular perfor-

mance metrics and three well-known datasets for NIDS are presented and discussed.

These datasets will be used for experiments in the other chapters. Besides, under new

semantic of detection in r-chunk based PSA (Section 1.5.4), we proved an important

theorem on the coincide of detection coverage of PSA and that of NSA. This theorem

will lead to one contribution of the thesis that will be discussed in next chapter.

Chapter 2 COMBINATION OF NEGATIVE SELECTION AND POSITIVE SELECTION

It can be seen from Theorem 1.1 in Chapter 1 that the r-chunk based PSA and

NSA are dual in term of detection. This motivates our approach to combining two

selection algorithms that does not aﬀect the detection performance of each.

2.1 Introduction

NSA and PSA are computational models that have been inspired by negative

and positive selection of the biological immune system. Among the two, NSA has been

studied more extensively resulting in more variants and applications [51]. However,

all of existing string-based NSAs have worst-case exponential memory complexity for

storing the detectors set, hence, limit their practical applicabilities [31]. In this chap-

ter, we introduce a novel selection algorithm that employs binary representation and

r-chunk matching rule for detectors. The new algorithm combines negative and pos-

itive selection to reduce both detector storage complexity and detection time, while

maintaining the same detection coverage as that of NSAs (PSAs).

In following section, we review some related works. Section 2.3 presents a new

r-chunk type NSA is presented in detail that is the combination of positive and nega-

tive selection, called PNSA. In our proposed approach, binary trees are used as data

structure for storing the detectors set to reduce memory complexity, and therefore the

time complexity of the detection phase. Section 2.4 details preliminary experimental

results. Summary of the chapter is in the ﬁnal section.

2.2 Related works

Both PSA and NSA achieve quite similar performance for detecting novelty

in data patterns [24]. Dasgupta et al. [21] conducted one of the earliest experiments

on combining positive with negative selection. The combined process is embedded in

a genetic algorithm using a ﬁtness function that assigns a weight to each bit based

on the domain knowledge. Their method is neither aimed to reduce detector storage

complexity nor detection time. Esponda et al. [34] proposed a generic NSA for anomaly

detection problems. Their model of normal behavior is constructed from an observed

sample of normally occurring patterns. Such a model could represent either the set

of allowed patterns (positive detection) or the set of anomalous patterns (negative

detection). However, their NSA is not concerned with the combination of positive and

negative selection in detection phase as in proposed one. Stibor et al. [80] argued that

positive selection might have better detection performance than negative selection.

However, the choice between positive selections and negative ones obviously depends

on representation of the AIS-based applications.

To the best of our knowledge, there has not been any published attempt in

combining r -chunk type PSA and NSA for the purpose of reducing detector storage

complexity and detection time complexity.

2.3 New Positive-Negative Selection Algorithm

Our algorithm ﬁrst constructs (cid:96) − r + 1 binary trees (called positive trees)

corresponding to (cid:96) − r + 1 positive r-chunk detectors set Dpi, i = 1, . . . , (cid:96) − r + 1. Then,

all complete subtrees of these trees are removed to achieve a compact representation of

the positive r-chunk detectors set while maintaining the detection coverage. Finally, for

every ith positive trees, we decide whether or not it should be converted to the negative

tree, which covers the negative r-chunk detectors set Dni. The decision depends on

which tree is more compact. When this process is done, we have (cid:96) − r + 1 compact

binary trees that some of them represent positive r-chunk detectors and the others

represent negative ones.

The r-chunk matching rule on binary trees is implemented as follows: a given

sample s matches the positive (negative) tree ith if s[i . . . i + k] is a path from the

root to a leaf, i = 1, . . . , (cid:96) − r + 1, k < r. The detection phase can be conducted by

traveling the compact binary trees iteratively one by one: a sample s is claimed as

nonself if it matches a negative tree or it does not match all positive trees, otherwise

it is considered as self.

Example 2.1. For the set of six self strings from Example 1.3, S = {00000, 00010,

10110, 10111, 11000, 11010}, where (cid:96) = 5 and r = 3, the six binary trees (the left

and right child are labeled 0 and 1 respectively) represent the detectors set of six 3-

chunk detectors (Dpi and Dni, i = 1, 2, 3) as depicted in Fig. 2.1. In the ﬁgure, dashed

arrows in some positive trees mark the complete subtrees that will be removed to achieve

compact tree representation. The positive trees for Dp1, Dp2 and Dp3 are in (a), (c)

and (e), respectively; The negative trees for Dn1, Dn2 and Dn3 are in (b), (d) and (f ),

respectively.

The number of nodes of the trees in Figures 2.1.a - 2.1.f (after deleting complete

subtrees) are 9, 10, 7, 6, 8 and 8, respectively. Therefore, the chosen ﬁnal trees are

those in Figures 2.1.a (9 nodes), 2.1.d (6 nodes) and 2.1.e or 2.1.f (8 nodes). In real

implementation, it is unnecessary to generate both positive trees and negative trees.

Since each Dpi could dually be represented either by a positive or a negative tree,

we only need to generate (compact) positive trees. If a compact positive tree T has

more number of leaves than the number of internal nodes that have single child, the

corresponding negative tree T (cid:48) has less number of nodes than that of T . Therefore, T (cid:48)

should be used instead of T to represent Dni more compactly. Figure 2.3 presents a

diagram of the algorithm. The following example illustrates this observation.

Example 2.2. Consider again the set of six self strings S from Example 1.3, S =

{00000, 00010, 10110, 10111, 11000, 11010}. The compact positive tree for the positive

3-chunk detectors set Dp2 = {(000,2); (001,2); (011,2); (100,2); (101,2)} is showed

Figure 2.1: Binary tree representation of the detectors set generated from S.

in Fig. 2.2.a. This tree has three leaves and two nodes that have only one child (in

dotted circles) so it should be converted to the corresponding negative tree as illustrated

in Fig. 2.2.b.

Figure 2.2: Conversion of a positive tree to a negative one.

Algorithm 2.1 Detector Generation Algorithm. 1: procedure DetectorGeneration(S, r, T )

Input: A set of self strings S ⊆ Σ(cid:96), a matching threshold r ∈ {1, . . . , (cid:96)}. Output: A set T of (cid:96) − r + 1 preﬁx trees presenting all r-chunk detectors.

T = ∅ for i = 1, ..., (cid:96) − r + 1 do

create an empty preﬁx positive tree Ti for all s ∈ S do

insert every s[i . . . i + r − 1] into Ti

2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13:

end for for all internal node n ∈ Ti do if n is root of complete binary subtree then delete this subtree end if

end for if (number of leaves of Ti) > (number of nodes of Ti that have only one child) then

for all internal node ∈ Ti do

if it has only one child then if the child is a leaf then delete the child

end if create the other child for it end if

end for mark Ti as a negative tree

14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: end for 25: 26: end procedure

end if T = T ∪ {Ti}

The proposed technique is summarized by Algorithm 2.1 and Algorithm 2.2.

The ﬁrst algorithm, Algorithm 2.1, generates (cid:96) − r + 1 trees where each of these is

labeled with self or nonself. The process of generating compact binary (positive and

negative) trees representing the complete r-chunk detectors set is conducted in the outer

“for” loop. First, all binary positive tree Ti are constructed by the ﬁrst inner loop.

Then, the compactiﬁcation of all Ti is conducted by the second one, i = 1, . . . , (cid:96) − r + 1.

The conversion of a positive tree to negative one takes place in “if” statement after the

second inner “for” loop. The procedure for recognizing a given cell string s as self or

nonself, is carried out by the last “while . . . do” and “if . . . then . . . else” statements.

Figure 2.4 presents a diagram of the algorithm.

Figure 2.3: Diagram of the Detector Generation Algorithm.

The detection phase could be done by the second algorithm, Algorithm 2.2, and

be illustrated by the following example.

Example 2.3. Given S, r as in Example 1.3, S = {00000, 00010, 10110, 10111,

11000, 11010}, and s = 10100 as the inputs of the algorithm, three binary trees are

constructed as the detectors set in Figures 2.1.a, 2.1.d. and 2.1.e. One of the output of

the algorithm is “s is nonself ” because all the paths of tree T2 do not contain substring

of s: s[2. . . 4] = 010.

Algorithm 2.2 Positive-Negative Selection Algorithm. 1: procedure PNSA(T , r, s)

Input: A set T of (cid:96)−r +1 preﬁx trees presenting all r-chunk detectors, a matching threshold r ∈ {1, . . . , (cid:96)}, an unlabeled string s ∈ Σ(cid:96). Output: A label of s (as self or nonself). (cid:46) A temporary boolean variable

f lag = true i = 1 while (i ≤ (cid:96) − r + 1) and (f lag = true) do

if (Ti is a positive tree) and (s /∈ Ti) then f lag = false

end if if (Ti is a negative tree) and (s ∈ Ti) then f lag = false

2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: end if 17: 18: end procedure

end if i=i+1 end while if f lag = false then output s is nonself else output s is self

From the description of DetectorGeneration, it is trivial to show that it takes

|S|((cid:96)−r+1).r steps to generate all necessary trees (detector generation time complexity)

and ((cid:96) − r + 1).r steps to verify a cell string as self or nonself in the worst case (worse-

case detection time complexity). These time complexities are similar to the popular

NSAs (PSAs) such as the one proposed in [31]. However, by using compact positive

and negative binary trees for storing the detectors set, PNSA could reduce the storage

complexity of the detectors set in comparison with the other r-chunk type single NSAs

or PSAs that store detectors as binary strings. This storage complexity reduction could

potentially lead to better detection time complexity in real and average cases. To see

this, ﬁrst, let the following theorem be stated:

Theorem 2.1 (PNSA detector storage complexity). Given a self set S and an integer

Figure 2.4: Diagram of the Positive-Negative Selection Algorithm.

(cid:96), the procedure DetectorGeneration produces the detector (binary) tree set that have

at most total ((cid:96) − r + 1).2r−2 less number of nodes in comparison to the detector tree

set created by a PSA or NSA only, where r ∈ {2, . . . , (cid:96) − r + 1}.

Proof. We only prove the theorem for the PSA case, the NSA case can be proven in a

similar way. Because there are ((cid:96) − r + 1) positive trees can be build from the self set

S, so the theorem is proved if it can reduce at most 2r−2 nodes from a positive tree.

The theorem is proved by induction on r (also the height of binary trees).

It is noted that when converting a positive tree to a negative tree, the reduction

in number of nodes is exactly as the result of the subtraction of number of leaf nodes

from the number of internal nodes that have only one child.

When r = 2, there are 16 trees of possible positive trees are of height 2. By

examining all 16 cases, we have found that the maximum reduction in number of

nodes is 1. One example of these cases is the positive tree that has 2 leaf nodes after

compactiﬁcation as in Fig. 2.5.a. Since it has two leaf nodes and one one-child internal

node, after being converted to the corresponding negative tree, the number of nodes is

reduced by 2 - 1 = 1.

Figure 2.5: One node is reduced in a tree: a compact positive tree has 4 nodes (a) and its conversion (a negative tree) has 3 node (b).

It is supposed that the theorem’s conclusion is right for all r < k. We shall

prove that it is also right for k. This is done by an observation that in all positive

trees that are of height k, there is at least one tree with both left subtree and right

subtree (of height k − 1) that each can be reduced by at least 2(k−1)−2 nodes after

conversion.

A real experiment on network intrusion dataset showed at the end of following

section shows that the storage reduction is only about 0.35% of this maximum.

2.4 Experiments

Next, we investigate the possible impact of the reduction in detector storage

complexity in PNSA on the detection real (average) time complexity in comparison

with single NSA (PSA). All experiments are performed on a laptop computer with

Windows 8 Pro 64-bit, Intel Core i5-3210M, CPU 2.50GHz (4 CPUs), 4GB RAM.

Table 2.1 shows the results on detector memory storage and detection time of

PNSA compared to a popular NSA proposed in [31] on some combinations of S, (cid:96) and

r. The training dataset of selves S contains randomly generated binary strings. The

memory reduction is measured as the ratio of reduction in number of nodes of the

binary tree detectors generated by PNSA when compared to the binary tree detectors

generated by the NSA in [31]. The comparative results show that when (cid:96) and r are

suﬃciently large, the detector storage complexity and the detection time of PNSA are

signiﬁcantly smaller than NSA in [31] (36% and 50% less).

Table 2.1: Comparison of memory and detection time reductions.

S 1,000 2,000 2,000 2,000 (cid:96) 50 30 40 50 r Memory (%) Time (%) 12 15 17 20 0 2.5 25.9 36.3 0 5 42.7 50

We have conducted another experiment by choosing (cid:96) = 40, |S| = 20,000 (S is

the set of randomly generated binary strings of length (cid:96)) and varying r (from 15 to

40). Then, (cid:96) − r + 1 trees were created using single NSA and other (cid:96) − r + 1 compact

trees were created using PNSA. Next, both detectors sets were used to detect every

s ∈ S. Fig. 2.6 depicts the detection time of PNSA and NSA in the experiment. The

results show that PNSA detection time is signiﬁcantly smaller than that of NSA. For

instance, when r is from 20 to 34, detection in PNSA is about 4.46 times faster than

that of NSA.

Next experiment is conducted on Netﬂow dataset, a conversion of Tcpdump

from well-known DARPA dataset to Netﬂow [86]. We use all 105,033 normal ﬂows as

self samples. This self set is ﬁrst converted to binary string of length 104, then we run

Figure 2.6: Detection time of NSA and PNSA.

our algorithm on r changing from 5 to 45. Table 2.2 shows some of the experiment

steps. The percentage of node reduction is in the ﬁnal column. Fig. 2.7 depicts the

reduction of nodes in trees created by PNSA comparison to that of NSA for all r = 3,..,

45. It shows that the reduction is more than one third when the matching threshold

greater than 19.

Table 2.2: Comparison of nodes generation on Netﬂow dataset.

r 5 10 15 20 25 30 35 40 45 NSA 727 33,461 1,342,517 9,428,132 18,997,102 29,668,240 42,596,987 58,546,497 79,034,135 PNSA 706 31,609 1,154,427 6,157,766 11,298,739 17,080,784 24,072,401 32,841,794 44,194,012 Reduction(%) 2.89 5.53 14.01 34.68 40.52 42.42 43.48 43.90 44.08

An experiment on Spambase dataset [8] is presented in an our work [A4], it

shows that maximum nodes reduction percentages of PNSA in comparison to PSA and

NSA for some r are 20% and 40%, respectively.

2.5 Summary

In this chapter, we have proposed a novel approach to combining PSA and

NSA. The new algorithm, PNSA, uses a compact representation of the detectors set as

Figure 2.7: Nodes reduction on trees created by PNSA on Netﬂow dataset.

Figure 2.8: Comparison of nodes reduction on Spambase dataset.

positive and negative binary trees. PNSA was theoretically proven to achieve better

detector storage complexity in comparison to single NSA or PSA while maintaining the

detection coverage, detector generation time complexity, and worst-case detection time

complexity. It in fact can balance the availability of both self samples and nonself ones.

This could potentially lead to lower detection time on real cases (i.e smaller average

detection time complexity) as preliminarily conﬁrmed in our experiments.

The main contents of this chapter is published in a Vietnamese journal [A4].

Also, this article is changed about 30% in comparison to its previous version in ISI-

indexed proceedings of the IEEE International Conference on Computing and Commu-

nication Technologies, Research, Innovation, and Vision for the Future (RIVF), 2013.

Chapter 3 GENERATION OF COMPACT DETECTOR SET

In this chapter, we present a new NSA to generate a complete and nonredundant

detectors set that uses an extension of r-chunk matching rule. This allows to reduce

detectors storage and classiﬁcation time. Experimental results on two datasets show

the eﬀectiveness of proposed algorithm.

3.1 Introduction

A complete and nonredundant detector set, also called perfect detectors set, is

necessary to archive the maximum coverage for detectors with minimum number of

detectors [30]. In general, the larger detector set doesnt always contribute more in

coverage because of the overlap. However, in case of nonredundant representation, the

overlap problem is always not reachable.

To date, there have been some computing models of binary detectors that could

generate a complete and non-redundant detectors set, called perfect detectors set-a

set of minimum detectors with the same detection ability in comparing to that of all

possible detectors, such as those based on preﬁx trees [31], or automata [32]. In these

models, detectors are represented as a whole structure (tree or automata) rather than

a set of individual strings. While they provide a more compact representation of the

detectors sets for AIS; and therefore achieve a better detection time complexity, these

models of binary-based AIS are hard to deploy in distributed environments. Naturally,

one desirable property of NSA is its ability to be implemented in a distributed manner

- each detector might detect diﬀerent kinds of nonself, which is desirable for many

applications such as in computer security systems. Therefore, the focus of this chapter

is on binary-based NSAs that employ a discrete set of detectors (strings) in order that

they can be implemented in distributed environments. We can, for example, randomly

divide the discrete set of detectors into some subsets, each of which is for a site in

distributed environments.

The rest of the chapter is organized as follows. In the next section, we ﬁrst brieﬂy

review some related works. In Section 3.3, we detail our new NSA that can generate

perfect detectors sets base on rcbvl. Section 3.4 brieﬂy describes our experiments in

generating perfect detectors sets. Section 3.5 summaries the chapter and discusses

some possible future works.

3.2 Related works

With respect to binary-based AIS using discrete detectors set, to the best of

our knowledge, the only algorithm for generating a perfect and discrete set of r-chunk-

based (rcb-based) detectors was proposed by T. Stibor in [78] (by Wierzcho´n in [90]),

which has frequently been cited, compared, and applied in the literature1.

The algorithm in [78] uses a hash table data structure to insert, delete and

search eﬃciently boolean values, which are indexed with a composite key of r -chunk

string concatenated with detector index i. The algorithm in [90] uses a discrete and

partial detectors to ﬁnd their overlapping patterns (strings). It also takes time to ﬁnd

ineﬀective patterns in preprocessing step. In spite of generating perfect detectors sets,

both of the algorithm do not use compact representation of discrete detectors.

In this chapter, we introduce new deterministic algorithm to generate a perfect

and discrete set of rcbvl-based detectors, which is equitable to a full set of r-chunk-based

detectors in term of anomaly detection. Moreover, compact string-based detectors set

can achieve better memory and time complexities compared to conventional string-

1To the time of writing this chapter, the work in [78] (in [90]) has had 44 (47) citations on Google

Scholar.

based algorithms. Note that, as discussed in Chapter 1, perfect detectors set cannot

solve the problem of holes.

3.3 New negative selection algorithm

Given a non-empty set S of self strings of length (cid:96), and an integer r ∈ {1, . . . , (cid:96)−

r + 1}, this section presents a new NSA based on rcbvl matching rule. Some preﬁx

trees are ﬁrst used to generate perfect detectors set from S, and then this set is used

to distinguish if a new sample as self or nonself.

3.3.1 Detectors set generation under rcbvl matching rule

Algorithm 3.1 summarizes the ﬁrst phase of new NSA, called RNSA. In the

algorithm, we ﬁrst construct for every position i ∈ {1, . . . , (cid:96) − r + 1} a preﬁx tree

Ti. These trees are temporary data structures for detectors generation. Each preﬁx

tree Ti can be constructed as follows: start with an empty preﬁx tree and insert every

s[i . . . i + r − 1], s ∈ S, into it (lines 4-6). Next, for every non-leaf node n and every

c ∈ Σ where no edge with label c starts at n, create a new leaf n(cid:48) and an edge (n, n(cid:48))

labeled with c. Finally, delete every node from which none of the newly created leaves

is reachable (lines 9-15). Detectors set D is ﬁrst created by all s ∈ T1 in line 18. The

rest of the algorithm, lines 19-42, updates partial detectors in D by identifying their

right overlapping strings in preﬁx trees.

From the description of the algorithm, it takes |S|.((cid:96) − r + 1).r steps to generate

((cid:96) − r + 1) preﬁx trees and |D|.((cid:96) − r + 1).2r steps to generate perfect detectors set D.

Figure 3.1 presents a diagram of the algorithm.

Example 3.1. Given (cid:96), r and the set of self strings S as in Example 1.1, S = {010101,

111010, 101101, 100011, 010111}. Some steps in the Algorithm 3.1 generate a perfect

detectors set {(0001,1,2), (00100,1,4), (100,4,4), (011110,1,4), (11000,1,3)} as in Ex-

ample 1.4 are:

Set D is ﬁrst created as {(00,1,1), (011,1,1), (110,1,1)}. Then the for loop (lines

19-42) calculates D and D1 as following:

For i = 2: D = {(0001,1,2), (0010,1,2), (0111,1,2), (1100,1,2)} and D1 = ∅.

For i = 3: D = {(00100,1,3), (01111,1,3), (11000,1,3)} and D1 = {(0001,1,2)}.

Algorithm 3.1 Algorithm to generate perfect rcbvl detectors set 1: procedure GenerationDetectors(S, r, D)

Input: A set of self strings S ⊆ Σ(cid:96), a matching threshold r ∈ {1, . . . , (cid:96)}. Output: A set D of rcbvl detectors

for i = 1, ..., (cid:96) − r + 1 do Create an empty preﬁx tree Ti end for for all s ∈ S do for i = 1, ..., (cid:96) − r + 1 do

insert s[i . . . i + r − 1] into Ti end for

end for for i = 1, ..., (cid:96) − r + 1 do

for all non-leaf node n ∈ Ti and all c ∈ Σ do if no edge with label c starts at n then create a new leaf n(cid:48) and an edge (n, n(cid:48)) labeled with c. end if

end for delete node n ∈ Ti from which none of the newly created leaves is reachable.

end for D1 = ∅ D = {(s, 1, 1)|s ∈ T1} for i = 2, ..., (cid:96) − r + 1 do

D2 = ∅ for all (s, k, j) ∈ D do

if there exists a s(cid:48) ∈ Ti where s[i − k + 1...|s|] is preﬁx of it then

D2 = D2 ∪ {(s + s(cid:48)[|s| − j + k...|s(cid:48)|], k, i)} delete every node n ∈ Ti from which only nodes in the s(cid:48) is reachable for all s(cid:48) ∈ Ti where s[i − k + 1...|s|] is preﬁx of it do if |s| − i + k < r then

D2 = D2 ∪ {(s[|s|] + s(cid:48), i − 1, i)} else

2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31:

D2 = D2 ∪ {(s(cid:48), i, i)}

end if delete every node n ∈ Ti from which only nodes in the s(cid:48) is reach- able end for else

D1 = D1 ∪ {(s, k, j)} end if

end for for all s(cid:48) ∈ Ti do D2 = D2 ∪ {(s(cid:48), i, i)}

32: 33: 34: 35: 36: 37: 38: 39: 40: end for 41: D = D ∪ D1 42: 43: end procedure

end for D = D2

Figure 3.1: Diagram of a algorithm to generate perfect rcbvl detectors set.

For i = 4: D = {(00100,1,4), (011110,1,4)} and D1 = {(0001,1,2), (11000,1,3),

(100,4,4)}.

The ﬁnal step, D = D ∪ D1 in line 42, generates the perfect detectors set {(0001,1,2),

(00100,1,4), (100,4,4), (011110,1,4), (11000,1,3)}.

3.3.2 Detection under rcbvl matching rule

To detect if a given string s is self or nonself, we simply check our rcbvl matching

rule on s against every detector in D. If it is the case, output s is nonself, otherwise

s is self. The function min used in Algorithm 3.2 to return the smallest number from

two values. It is clear to see that this algorithm has the same time complexity with

Algorithm 3.1.

Algorithm 3.2 Algorithm to detect if a given string s is self or nonself 1: procedure VNSA(S, s, r, D)

Input: A set of self strings S ⊆ Σ(cid:96), an unlabeled string s ∈ Σ(cid:96), a matching threshold r ∈ {1, . . . , (cid:96)}. Output: A set D of rcbvl detectors, a label of s (self or nonself).

if s(cid:48)[i . . . min(i + r − 1, |s(cid:48)|)] occurs in s at position i then

output s is nonself exit procedure end if end for

2: GenerationDetectors(S, r, D) for all (s(cid:48), n, m) ∈ D do 3: for i = n, ..., m do 4: 5: 6: 7: 8: 9: 10: 11: 12: end procedure

end for output s is self

3.4 Experiments

In our experiments, we use a popular ﬂow-based dataset NetFlow [86] and a

random dataset. The ﬂow-based NetFlow is used for Experiment 1. Similar to the

previous studies [86], we select the same 4 features Packets, Octets, Duration and

Flags from the NetFlow dataset as the input of Experiment 1. A randomly created

dataset is used for Experiment 2. This dataset contains 50,000 binary strings with the

length of 100.

Flows in NetFlow are converted into binary strings by two steps. The ﬁrst step

is to map all features to binary string features. After this step, a total string features

are constructed for both normal and abnormal data. The second step is to concatenate

the binary string features for every ﬂows. After this step, dataset contains binary

strings with their length of 49. The distributions of training and testing datasets as

well as parameters r, (cid:96) for 4 experiments are described in Table 3.1.

Table 3.1: Data and parameters distribution for experiments and results comparison.

Size (in bits) (cid:96) r Train Test rcbvl Time (in minutes) r-chunk rcbvl

49 49 10 8 119,571 79,571 10,000 50,000 r-chunk Case 1 206,810 31,672 42,704 8,096 1 0.8 0.2 0.18

30 30 12 14 25,000 40,000 25,000 10,000 Case 2 367,092 2,324,056 79,222 392,815 4.1 8.65 0.75 1.4

Results in Table 3.1 show that our proposed algorithm reduces both size (in

bits) of detectors and time (in minutes) to classify testing dataset in comparison with

that of best algorithm in [78].

All experiments are performed on a laptop computer with Windows 8 Pro 64-bit,

Intel Core i5-3210M, CPU 2.50GHz (4 CPUs), 4GB RAM.

3.5 Summary

In this chapter, we have introduced a novel NSA to generate perfect detectors

sets for string-based AIS. We developed a rcbvl matching rule as an extension of tra-

ditional rcb. Our new algorithm has much lower time complexity than the previously

proposed algorithm for generating perfect detection sets in [78]. More importantly,

proposed algorithm always generate complete and non-redundant detectors sets for

string-based AIS. Experimented results show that proposed algorithm can reduce both

runtime of detection and storage of detectors.

Varying length of the parameter r in the rcbvl matching rule can balance spe-

cialisation and generalisation in classiﬁcation systems, which will be the next step of

our study. Moreover, improving our approach to generate perfect detectors set under

rcb matching rule is an interesting research direction.

According to discussion on holes in Section 1.5.5, the number of holes can be

eﬀected by matching rules. We anticipate that the rcbvl matching rule with length of

corresponding detectors may create an acceptable number of holes. This will be one of

our next research directions.

//Small contribution of this chapter is partly published in an international jour-

nal [A6].

Chapter 4 FAST SELECTION ALGORITHMS

4.1 Introduction

Although NSAs have been studied most in AIS literature and successfully ap-

plied not only computer science problems but also to biological research questions, their

runtime has limited their practical usefulness for problems with large scale data. We,

thus, investigate methods to compute the classiﬁcation outcome eﬃciently. We pre-

cisely reduce the computational complexity of this task based on techniques from string

processing and data compression. We obtain linear time algorithms for two important

matching rules, r -contiguous rule and r -chunk rule, for which negative selection had

been conjectured recently to be polynomial [32, 82, 84].

In this chapter, we present fast NSAs for r -contiguous rule and r -chunk rule,

noted as Chunk-NSA and Cont-NSA respectively, which generate compressions of com-

plete and non-redundant detectors sets in linear time with respect to size of training

data. The algorithms comprise of two main phases: generating a graph as a represen-

tation of all detectors and adjusting it to an automaton. We will prove the proposed

algorithms are r times faster in data training compared with that in [32]. Preﬁx DAG

is important data structure used in the algorithms.

In following section, we review mostly related works. The two following sections

present r-chunk detector-based NSA and r-contiguous detector-based NSA in detail,

respectively. Section 4.5 details preliminary experimental results. The summary of the

chapter is in the ﬁnal section.

4.2 Related works

Table 4.1 shows the comparison of our results with the runtimes of previously

published algorithms. All runtimes for training phase and classiﬁcation phase are

given for a binary alphabet (|Σ| = 2) since not all algorithms are applicable to arbitrary

alphabets. The parameter K = min{|S[i..i+r −1]|, i = 1, ..., (cid:96)−r +1}. The parameter

|D|, the desired number of detectors, is only applicable to algorithms that generate

detectors explicitly.

The early work of Dhaeseleer D’haeseleer et al. [29] proposed algorithms that

generate detector set by a exhaustive search. However, these algorithms still have a

runtime exponential in r. Some other algorithms were later proposed by Wierzcho´n [89],

Ayara et al. [6], and Stibor et al. [78] with the same complexities. Our algorithm and

algorithm in [32] produce the results that would be obtained with the maximal number

of generated detectors. We proposed a NSA that produces a substantial performance

improvement in training phase compared to the most eﬀective published NSA in [32].

Recently, Textor [82] proposed a sampling NSA. However, the algorithm is in-

herently ineﬃcient, as it will typically need exponentially many patterns to achieve a

suﬃcient coverage of the universe. This NSA is then generalized by ﬁnite automata-

based models to perform several related algorithmic tasks, which are not complicated

to implement and customize to real-world problems [84]. Nevertheless, the runtimes of

this NSA are still polynomial.

4.3 A fast negative selection algorithm based on r-

chunk detector

Each self string of S is supposed to have an associated index, S = {si, i =

1, .., |S|}. We also use the following notations:

• An array Q, where Q[s][c] is a pointer used for creating new nodes in the tree,

s ∈ Σr−1, and c ∈ Σ. This data structure is used for expanding the partial DAG

gradually.

• An array P , where P [i] is a structure of two ﬁelds, a pointer P [i].end and a string

Table 4.1: Comparison of our results with the runtimes of previously published algo- rithms.

Matching rules

r-chunk

r-contiguous

Algorithms Stibor et al. [78] Elberfeld, Textor [31] Elberfeld, Textor [32] Present thesis D’haeseleer et al. [29] (linear) D’haeseleer et al. [29] (greedy) Wierzcho´n [89] Elberfeld, Textor [31] Elberfeld, Textor [32] Present thesis Training (2r + |S|)((cid:96) − r + 1) r2|S|((cid:96) − r) |S|(cid:96)r |S|(cid:96) (2r + |S|)((cid:96) − r) 2r|S|((cid:96) − r) 2r(|D|((cid:96) − r) + |S|) |S|3(cid:96)3r3 |S|(cid:96)r |S|(cid:96) + (2r − K).(cid:96) Classiﬁcation |D|(cid:96) |S|(cid:96)2r (cid:96) (cid:96) |D|(cid:96) |D|(cid:96) |D|(cid:96) |S|2(cid:96)3r3 (cid:96) (cid:96)r

P [i].str ∈ Σr−1, i = 1, .., |S|. Each P [i] is joined with si, where the ﬁrst ﬁeld

P [i].end is used to point to the end node in the path of si in the preﬁx DAG and

the second ﬁeld P [i].str is a suﬃx of the path ended by P [i].end.

Two array Q and P are used to expand the preﬁx DAG step by step. The ﬁnal

automaton to decide the membership of strings in Σ(cid:96) is constructed by two phases.

The ﬁrst phase is to create a DAG G that L(G) ∩ Σ(cid:96) = Σ(cid:96) - L(CHUNK(S, r)) by

Algorithm 4.1 and this DAG is then turned into an automaton M that L(M ) ∩ Σ(cid:96)=

L(CHUNK(S, r)) in the second phase by Algorithm 4.2. In Algorithm 4.1, notations

NULL and new() are used with semantic similar to that in C programming language.

The key idea behind our construction is that instead of generating all (cid:96) − r + 1

preﬁx trees T1, ..., T(cid:96)−r+1 as in [32], we only create one tree T1 for S[1..r] explicitly and

then enlarge it by adding nodes and edges level-by-level to obtain a preﬁx DAG. After

this process, the DAG encodes all positive detectors S[i . . . i + r − 1], i = 1, ..., (cid:96) − r + 1.

This is then inverted to construct a compression of CHUNK(S, r).

Figure 4.1 presents a diagram of the algorithm.

Example 4.1. Let (cid:96) = 5, r = 3. Given a set of 7 self strings as in Example 1.2,

S = {01111, 00111, 10000, 10001, 10010, 10110, 11111}. The set CHUNK(S, 3) is

{(000, 1), (010, 1), (110, 1), (010, 2), (100, 2), (101, 2), (110, 2), (011, 3), (100, 3),

(101, 3)}. The set CONT(S, 3) is {01011, 11011}.

Algorithm 4.1 To generate positive r-chunk detectors set 1: procedure PositiveR-chunkDetector(S, r, G)

Input: A set of self strings S ⊆ Σ(cid:96), a matching threshold r ∈ {1, . . . , (cid:96)}. Output: A preﬁx DAG G presenting positive r-chunk detectors set.

G = ∅ for i = 1, ..., |S| do

insert si[1 . . . r] into G and assign P [i].end to the leaf node in path s[1 . . . r] P [i].str = s[2 . . . r]

end for for i = r + 1, ..., (cid:96) do for j = 1, ..., |S| do

if Q[P [j].str][sj[i]] = N U LL then Q[P [j].str][sj[i]] = new() end if

2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16:

end for for j = 1, ..., |S| do p = P [j].end for c ∈ Σ do if (Q[P [j].str][c] (cid:54)= N U LL) and(edge starts from p with label c does not exist) then create an edge starts from p with label c to Q[P [j].str][c]

end if P [j].end = end node of the edge starts from p with label sj[i] end for

end for for j = 1, ..., |S| do

17: 18: 19: 20: 21: 22: 23: 24: 25: end for 26: 27: end procedure

Q[P [j].str][sj[i]] = N U LL P [j].str = P [j].str[2...r − 1] + sj[i] end for

Example 4.2. Let (cid:96) = 5, r = 3 and S is self set from Example 4.1. The preﬁx DAG

generated by Algorithm 4.1 is illustrated in Fig. 4.2.a. Apter adjusting by Algorithm 4.2,

this DAG can be turned into an automaton as in Fig. 4.2.b. They have the properties

that L(G)∩{0, 1}5={0, 1}5\L(CHUNK(S,3)) and L(M )∩{0, 1}5= L(CHU N K(S, 3)).

Algorithm 4.3 summarizes the overall of proposed Chunk-NSA. In the algorithm,

we use a self set S, an integer r ∈ {1, . . . , (cid:96) − r + 1}, a cell string s to be detected, a

preﬁx DAG G. Chunk-NSA will detect s as self or nonself.

Theorem 4.1. Given any S ⊆ Σ(cid:96), s ∈ Σ(cid:96) and r ∈ {1, . . . , (cid:96)}, algorithm Chunk-NSA

Figure 4.1: Diagram of the algorithm to generate positive r-chunk detectors set.

constructs an automaton M with L(M ) ∩ Σ(cid:96) = L(CHU N K(S, r)) in time O(|S|.(cid:96).|Σ|)

and checks if s ∈ L(M ) in time O((cid:96)).

Proof. We ﬁrst construct a preﬁx DAG G as follows: start with an empty preﬁx DAG,

and also a preﬁx tree in this case, then insert every s ∈ S[1 . . . r] into it. This was done

by ﬁrst for loop in Algorithm 4.1 in time |S|.r.|Σ|.

Then for each s[i], s ∈ S, and i = r + 1, .., (cid:96), we add nodes and corresponding

Algorithm 4.2 To generate negative r-chunk detectors set 1: procedure NegativeR-chunkDetector(S, r, G)

Input: A set of self strings S ⊆ Σ(cid:96), a matching threshold r ∈ {1, . . . , (cid:96)}. Output: A preﬁx DAG G presenting negative r-chunk detectors set.

PositiveR-chunkDetector(S, r, G) create a special node n(cid:48) for each non-leaf node n ∈ G do for c ∈ Σ do

if no edge with label c starts at n then create new edge (n, n(cid:48)) labeled with c end if end for

2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: end for 15: 16: end procedure

end for for each node n ∈ G do if n is not reachable to n(cid:48) then delete n end if

edges to G by using linking pointers of P and Q as in lines 7-26 in Algorithm 4.1.

The ﬁrst inner for loop (line 8) is to create new pointer to expand G to next level.

Note that, P [i] is joined with si as mentioned at the beginning of this section. The

second inner for loop (line 13) is to add new nodes and corresponding edges to G. The

for loop in line 22 is to free all pointers in Q and to update all strings in P. It takes

((cid:96) − r) iterative steps to generate DAG G presenting all positive r-chunk detectors.

There are three inner for loops (lines 8, 13, 22), each with |S| steps. Therefore, the

time complexity of Algorithm 4.1 is |S| + 3.|S|.((cid:96) − r + 1).|Σ|1.

Algorithm 4.2 inverts the DAG G as follows: Firstly, a special node n(cid:48) is created.

Then for every non-leaf node n and every c ∈ Σ for which no edge with label c starts

at n, we create new edge (n, n(cid:48)) labeled with c. Finally, we delete every node n which

is not reachable to n(cid:48). There are no more than |S|.(cid:96).|Σ| nodes in the DAG G, so the

time complexity of Algorithm 4.2 is |S|.(cid:96).|Σ|.

The DAG G generated by Algorithm 4.2 is turned into a ﬁnite automaton M by

1Assuming that we can seek each element of arrays P and Q in unit time. This, for example, can

be done by operators in base-|Σ| numeral system.

making the leaf node n(cid:48) as an accepting state with self-loops for all c ∈ Σ and setting

Figure 4.2: A preﬁx DAG G and an automaton M

Algorithm 4.3 A fast r-chunk detector-based NSA 1: procedure Chunk-NSA(s, S, r, G, M )

Input: A set of self strings S ⊆ Σ(cid:96), an unlabeled string s ∈ Σ(cid:96), a matching threshold r ∈ {1, . . . , (cid:96)}. Output: A preﬁx DAG G presenting negative r-chunk detectors set, an automaton M , a label of s (self or nonself).

2: 3: 4: 5: 6: 7: end if 8: 9: end procedure

NegativeR-chunkDetector(S, r, G) turn G into an automaton M if s ∈ L(M ) then output s is nonself else output s is self

the initial state to the root of G (line 3 in Algorithm 4.3). Now we obtain automaton

M that has the properties claimed by the theorem. Moreover, it is easy to see that it

takes O((cid:96)) to check whether s ∈ L(M ) or not. So we obtain the claimed runtimes.

4.4 A fast negative selection algorithm based on r-

contiguous detector

In this section, two arrays P and Q are used as the same meaning in the previous

section. Besides, we use two set P1 and P2 of pointers for chunk detectors. They swap

their roles at the end of each step.

Theorem 4.2. There exists an algorithm that, given any S ⊆ Σ(cid:96) and r ∈ {1, . . . , (cid:96)},

constructs a ﬁnite automaton M with L(M ) ∩ Σ(cid:96) = CON T (S, r) in time O(|S|.(cid:96).|Σ| +

(Σr − K).(cid:96)), where K= min{|S[i..i + r − 1]|, i = 1, . . . , (cid:96) − r + 1}.

Algorithm 4.4 Algorithm to generate negative r-contiguous detectors set 1: procedure Cont-NSA(S, r, G)

Input: A set of self strings S ⊆ Σ(cid:96), a matching threshold r ∈ {1, . . . , (cid:96)}. Output: An automaton G presenting negative r-contiguous detectors set.

G = P1 = ∅ for s ∈ Σr \ S[1 . . . r] do

insert s into G and create p.end that points to the leaf node in path s p.str = s[2 . . . r] P1 = P1 ∪ {p}

end for for i = r + 1, ..., (cid:96) do for j = 1, ..., |S| do

if Q[P [j].str][sj[i]] = N U LL then Q[P [j].str][sj[i]] = new() end if

end for P2 = ∅ for each p ∈ P1 do for c ∈ Σ do if (Q[p.str][c] = N U LL) then

Q[p.str][c] = new() p.str = p.str[2...r − 1] + c p.end = Q[p.str][c] P2 = P2 ∪ {p} else if Q[p.str][c] is newly created in this inner for loop then

p.str = p.str[2...r − 1] + c p.end = Q[p.str][c] P2 = P2 ∪ {p} end if end if end for

end for for j = 1, ..., |S| do

Q[P [j].str][sj[i]] = N U LL P [j].str = P [j].str[2...r − 1] + sj[i]

end for for each p ∈ P1 do for c ∈ Σ do Q[p.str][c] = N U LL end for

2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37: 38: 39: 40: 41: 42:

end for P1 = P2

end for for each node n ∈ G do

if n is not reachable to a leaf node ∈ P1 then delete n end if

43: 44: 45: 46: 47: 48: end procedure

end for turn G into an automaton

Proof. We ﬁrst construct a preﬁx DAG G as follows: start with an empty preﬁx DAG,

and also a preﬁx tree in this case, then insert every s ∈ Σr \ S[1 . . . r] into it. We also

create P1 including all points that links to the leaves of G. This was done by ﬁrst for

loop in Algorithm 4.4 in time (Σr − |S[1..r]|).r. After this step, we have L(G) ∩ Σr =

CHUNK(S[1..r], r).

Then for each s[i], s ∈ S, and i = r + 1, .., (cid:96), we add nodes and corresponding

edges to G by using linking pointers of P and Q as in lines 8-39 in Algorithm 4.4. The

ﬁrst inner for loop (line 9) hashes every s ∈ S into Q. The second inner for loop (line

15) expands G to the next level. A main idea for this step is to lengthen each path in G

if the corresponding point p is not hashed by Q. In the other words, each path linked

by p in G is concatenated with it right overlapping r-chunk detector if this detector

exists, i.e. that corresponding to a NULL point in Q.

The next two loops (lines 31 and 35) are to free all pointers in Q and to update

all strings in P. The last for loop (line 42) deletes all partial paths in G that do not

reach to a leaf node in P1. In the other words, after this step, all leaf in G has level (cid:96).

It takes ((cid:96) − r) main iterative steps (line 8) to generate DAG G presenting

CHUNK(S, r). Two inner for loops (lines 9, 31), each with |S| steps. Two other inner

for loops (lines 15, 35), each with |P1| steps. We have |P1| ≤ |Σr| − |S[i..i + r − 1]|, i =

1, . . . , (cid:96) − r + 1. In the worst case will occur, |P1| = |Σr| − |S[i..i + r − 1]|, if there exist

an index i that all r-chunk detectors of Dni take part in expanding the DAG G.

Therefore, the time complexity of Algorithm 4.4 is (2.|S|.(cid:96).|Σ| + 2.(Σr − K).(cid:96)),

where K = min{|S[i..i + r − 1]|, i = 1, . . . , (cid:96) − r + 1}.

Finally, we turn DAG G into a ﬁnite automaton with the claimed property by

making all leaves accepting states with self-loops for all c ∈ Σ and setting the initial

state to the root of G. An example of this construction for dataset in Example 4.1 is

Figure 4.3: Diagram of the algorithm to generate negative r-contiguous detectors set.

showed in Fig. 4.4.

Figure 4.4: An automaton represents 3-contiguous detectors set.

Corollary 4.1. Given any S ⊆ Σ(cid:96), s ∈ Σ(cid:96) and r ∈ {1, . . . , (cid:96)}. After time O(|S|.(cid:96).|Σ|+

(Σr − K).(cid:96)) preprocessing, where K = min{|S[i..i + r − 1]|, i = 1, . . . , (cid:96) − r + 1}, we

can check if s ∈ L(CON T (S, r)) in time O(|(cid:96)|.r).

Proof. Firstly, a DAG G representing all r-contiguous detectors is created by Algo-

rithm 4.4 in time O(|S|.(cid:96).|Σ| + (Σr − K).(cid:96)). Secondly, to check if s ∈ L(CON T (S, r))

we can match each substring |s[i..i + r − 1]|, i = 1, . . . , (cid:96) − r + 1 to corresponding

detectors in G level-by-level.

Figure 4.3 presents a diagram of the algorithm.

4.5 Experiments

In our experiments, we use a popular ﬂow-based dataset NetFlow [86] and a

random dataset. The ﬂow-based NetFlow is generated from packet-based DARPA

dataset [1] is used for Experiment 1. It was encoded as a set of 105,033 binary strings

with their length of 49. A randomly created dataset containing 50,000 strings of length

100 is used for Experiment 2. Distribution of parameter values and running time (in

milliseconds) of experiments are showed in Table 4.2 and in Table 4.3. All experiments

are performed on a laptop computer with Windows 8 Pro 64-bit, Intel Core i5-3210M,

CPU 2.50GHz (4 CPUs), 4GB RAM.

In Table 4.2, the runtime of r-chunk detector-based NSA in [32] on Experiment

1 and Experiment 2 are in columns a and c, respectively. Also, the runtime of proposed

Chunk-NSA on Experiment 1 and Experiment 2 are in columns b and d, respectively.

As can be seen from the results in the table that there is a positive correlation between

threshold r and the ratio of runtime of NSA [32] to runtime of Chunk-NSA (as in

columns a/b and c/d).

Table 4.2: Comparison of Chunk-NSA with r-chunk detector-based NSA.

10 11 12 13 14 15 16 17 18 19 20 Experiment 1 a/b b 2.9 454 3.2 439 3.4 454 4.1 435 4.2 418 4.3 486 4.5 437 5.3 391 5.5 410 6.3 375 7.0 359 a 1,330 1,395 1,564 1,767 1,771 2,092 1,985 2,071 2,249 2,345 2,859 Experiment 2 c/d d 3.1 482 3.5 472 4.5 360 4.7 453 5.0 451 6.2 450 8.5 365 9.6 427 10.7 422 11.3 470 15.6 437 c 1,490 1,633 637 2,134 2,276 2,793 3,086 4,079 4,509 5,312 6,796

Fig. 4.5 shows that ratios are closer to r when r increases, and it relatively ﬁts

theoretical proof in the previous section. In the ﬁgure, ratio 1 and ratio 2 are ratios

of runtime of r-chunk detector-based NSA to runtime of Chunk-NSA in Experiment

1 and Experiment 2, respectively. As presented in the bottom row of Table 4.2 that

when r = 20, the training time of Chunk-NSA on Experiments 1 and 2 are 7.0 and

15.6 times faster than that of 20-chunk detector-based NSA in [32], respectively.

Figure 4.5: Comparison of ratios of runtime of r-chunk detector-based NSA to runtime of Chunk-NSA

Similarly, Table 4.3 reveals that the runtime of r-contiguous detector-based

NSA in [32] on Experiment 3 and Experiment 4 are in columns a and c, respectively.

Also, the runtime of proposed Cont-NSA on Experiment 3 and Experiment 4 are in

Table 4.3: Comparison of proposed Cont-NSA with r-contiguous detector-based NSA.

10 11 12 13 14 15 16 17 18 19 20 Experiment 3 b a 1125 1453 1032 1437 1203 3082 1047 2344 1156 2688 1188 4578 1254 7181 1313 9985 1235 12157 1500 19652 1797 24156 a/b 1.3 1.4 2.6 2.2 2.3 3.9 5.7 7.6 9.8 13.1 13.4 Experiment 4 d c 953 1500 1109 1704 1063 1906 1015 2157 1031 3188 1235 4203 1125 6484 1156 8594 1047 12923 1625 18547 1922 24282 c/d 1.6 1.5 1.8 2.1 3.1 3.4 5.8 7.4 12.3 11.4 12.6

Figure 4.6: Comparison of ratios of runtime of r-contiguous detector-based NSA to runtime of Cont-NSA

columns b and d, respectively. According to described results above, there is a positive

correlation between threshold r and the ratio of runtime of NSA [32] to runtime of

Cont-NSA (as in columns a/b and c/d).

Fig. 4.6 introduces that ratios are closer to r when r increases, and it relatively

ﬁts theoretical proof in the previous section. In the ﬁgure, ratio 3 and ratio 4 are

ratios of runtime of NSA to runtime of Cont-NSA in Experiment 3 and Experiment

4, respectively. Results in the bottom row of Table 4.3 show that when r = 20, the

training time of Cont-NSA on Experiments 3 and 4 are 13.4 and 12.6 times faster than

that of 20-contiguous detector-based NSA in [32], respectively.

4.6 Summary

In this chapter, we have introduced a novel fast NSA to generate r-chunk detec-

tors set in linear time. Empirical results and theoretical proof show that the proposed

algorithm Chunk-NSA outperform the benchmark one in [32].

A limitation of Chunk-NSA is that it uses two arrays Q and P temporarily to

expand the preﬁx DAG representing all positive detectors. The size of Q and P are |Σ|r

and |S|.r, respectively. However, this drawback may not be seen as a serious problem

because of modern systems that support with huge internal memory, especially it will

be more feasible in case of binary, Σ = {0, 1}.

The main limitation of Cont-NSA is that it takes time (cid:96).r in detection phase.

However, the generation of a DAG presenting all r -contiguous detectors takes only a

linear time with respect to cardinality of S (Theorem 4.2). This, therefore, provides

additional beneﬁts for operation such as to count the number of detectors, or to list

detectors explicitly for distributed detection.

To verify the eﬀectiveness of the proposed approaches, two datasets are adopted

to validate this approaches. The results indicate that the proposed approaches can

improve much of training time. We anticipate that the Cont-NSA can be further

developed by using more complicated graphs for better detection time. This will be

discussed speciﬁcally in our future research. The main contribution of this chapter is

submitted to an ISI indexed journal.

Chapter 5 APPLYING HYBRID ARTIFICIAL IMMUNE SYSTEM FOR NETWORK SECURITY

In this chapter, we present our algorithm, called PSA2, to apply a hybrid algo-

rithm that combines two-class PSA and statistical technique to achieve better perfor-

mance of intrusion detection.

5.1 Introduction

This chapter presents a hybrid algorithm that combines two-class PSA and sta-

tistical technique to achieve better performance of intrusion detection. This improved

PSA is trained with both self and nonself samples while a typical PSA is trained with

one type of samples only. Two types of detectors are generated by PSA2 to cover

area of both self space and nonself space. The ﬁrst one generated by original PSA.

Considering nonself as self, we apply PSA with new self samples set to obtain the

second detectors type. Both types of these detectors will be used to detect normal and

abnormal data instances in our NIDS.

To the best of our knowledge, there has not been any published attempt in

modeling PSA or its variant for ﬂow-based NIDS. Our algorithm PSA2 would be the

ﬁrst example with respect to ﬂow-based AIS-NIDS. Recently, Fernandes el al. have

commented in their survey that AIS-NIDS should be combined with network traﬃc

statistics to minimize the obstacle the amount of network traﬃc and diminish the

computational burden [35]. Therefore we believe that our proposed algorithm in this

chapter is the ﬁrst attempt in combining selection algorithms with statistical technique.

The rest of the chapter is organized as follows. In the next section, we review

the works most related to our thesis. Section 5.2 presents the PSA2 in detail. Four

experiments on two datasets are introduced in Section 5.3. Section 5.4 concludes the

chapter and discusses some possible future works.

5.2 Related works

Recently, artiﬁcial intelligence-based intrusion detection systems attract con-

siderable interest from researchers [43]. Various artiﬁcial intelligent algorithms are

deployed to build NIDSs (ﬂow-based or network-based or hybrid ones) or their com-

ponents. Support vector machines (SVMs), fuzzy systems, and rule-based system are

becoming popular in the context of producing software components for IDS [96, 61, 91].

We have also seen the recent development in this area with natural computation such

as AIS [58, 74, 97, 47], and evolutionary algorithms, like particle swarm optimization

(PSO) [28], genetic algorithm (GA) [14], ant colony optimization (ACO) [63, 95].

Statistical techniques have attracted a lot of attention by researchers in network

security ﬁeld over the last three decades [88]. Simple statistical techniques can be used

to describe the network traﬃc and that vary signiﬁcantly from the normal behavior to

the anomalous one [16]. However, our review shows that the approach in this chapter

is the ﬁrst attempt in combining selection algorithms with statistics for the purpose of

intrusion detection.

As mentioned in Section 1.6, one major factor about the network ﬂow is that ﬂow

does not provide any payload. This method processes only packet headers; therefore,

it causes a signiﬁcant drop in the volume of traﬃc to be processed. For this reason,

ﬂow-based NIDS currently attracts much attention of researchers, especially for high-

speed networks [99]. For example, a system operates on lightweight network ﬂows

and uses One-Class SVMs for analysis is proposed in [91]. Diﬀerent from traditional

anomaly detection systems, the system is trained with malicious rather than with

benign network data. The system is suited for the load of large-scale networks and is

less aﬀected by typical problems of ordinary anomaly detection systems. A method

focuses on the traﬃc ﬂow statistics of attackers who send ﬂows to both the object

port and generally closed port in the global network is proposed in [69]. This ﬂow-

based attacker detection method accurately identiﬁes hosts sending ﬂows to object port

as attackers, without any deep packet inspection. An improved block-based neural

network (BBNN) is introduced in [86]. The BBNN evolved by modifying original

Genetic Algorithm. The experiment on a packet-based DARPA dataset shows that

enhanced BBNN outperforms some other algorithms. A centroid-based partitioning

strategy technique for data preprocessing is proposed in [43]. The method can select a

high-quality subset that has far fewer instances from the original training data to build

lightweight NIDSs. Experimental results on ﬂow-based datasets introduced in [76, 91]

show that it can achieve admirable eﬃciency in training time and comparable detection

performance. Multilayer perceptron (MLP) neural network and Gravitational Search

Algorithm (GSA) are used for NIDS to classify benign and malicious ﬂows with very

high accuracy rate [70]. A hybrid of two metaheuristic algorithms, PSO and GSA,

called PSOGSA algorithm, is examined to optimize the interconnection weights of a

MLP neural network [48]. This metaheuristic algorithm helps a MLP-based anomaly

detector to improve accuracy and false alarm rate. M. J. Chapple et al. [17] and

M. Zolotukhin et al. [99] consider network traﬃc as a time series, a powerful tool to

describe network traﬃc evolution patterns, to build eﬀective NIDSs. An extended

version of inductive supervised SVMs, called S4VM, is recently proposed in [49]. This

semi-supervised method to develop a ﬂow-based NIDSs which enables training with

both labelled and unlabelled data. It is very eﬀective to detect anomalies in a small

amount of labelled data.

Above works achieved remarkable results, however, their FAR are relatively

high. Our proposed approach in this chapter combining NSA, PSA and simple statis-

tical techniques outperforms all mentioned works in terms of DC, ACC and FAR.

5.3 Hybrid positive selection algorithm with chunk

detectors

Given (cid:96), r, a normal dataset N ⊂ Σ(cid:96), an abnormal dataset A ⊂ Σ(cid:96). Algo-

rithm 5.1 and Algorithm 5.2 create positive trees and negative trees, respectively. A

new data instance s ∈ Σ(cid:96) is detected as self or nonself by Algorithm 5.3.

Algorithm 5.1 Algorithm to generate positive trees 1: procedure SelfTreesGeneration(N , r, TN )

Input: A set of self strings N ⊆ Σ(cid:96), a matching threshold r ∈ {1, . . . , (cid:96)}. Output: A set TN of (cid:96) − r + 1 preﬁx trees presenting positive trees. for i = 1, ..., (cid:96) do

Create an empty tree TNi

end for for all s ∈ Nr do for i = 1, ..., (cid:96) do

2: 3: 4: 5: 6: 7: 8: end for 9: 10: end procedure

insert every s[i . . . i + r − 1] into TNi end for

Algorithm 5.2 Algorithm to generate negative trees 1: procedure NonselfTreesGeneration(A, r, TA)

Input: A set of nonself strings A ⊆ Σ(cid:96), a matching threshold r ∈ {1, . . . , (cid:96)}. Output: A set TA of (cid:96) − r + 1 preﬁx trees presenting negative trees. for i = 1, ..., (cid:96) do

Create an empty tree TAi

end for for all s ∈ Ar do for i = 1, ..., (cid:96) do

2: 3: 4: 5: 6: 7: 8: end for 9: 10: end procedure

insert every s[i . . . i + r − 1] into TAi end for

In Algorithm 5.3, Leaf(s, T ) is a function to return frequency value correspond-

ing with a string s ∈ Σr, this value is contained in the leaf of the path labeled by s in

tree T . Parameters d1 and d2 sum up frequencies of s in positive trees TN and negative

trees TA, respectively. Four other parameters t1, t2, t3, t4 are also used in PSA2 to

control its performance.

Algorithm 5.3 Algorithm PSA2 to detect if a new data instance s ∈ Σ(cid:96) is self or nonself 1: procedure PSA2(N , A, s, r, TN , TA)

Input: A set of nonself strings N ⊆ Σ(cid:96), a set of self strings A ⊆ Σ(cid:96), an unlabeled string s ∈ Σ(cid:96), a matching threshold r ∈ {1, . . . , (cid:96)}. Output: A set TA of (cid:96) − r + 1 preﬁx trees presenting negative trees, a set TN of (cid:96) − r + 1 preﬁx trees presenting positive trees, a label of s (self or nonself).

SelfTreesGeneration(N , r, TN ) NonselfTreesGeneration(A, r, TA) d1 = d2 = d3 = 0 Create a string sr as ring representation of the string s for i = 1, ..., (cid:96) do

s(cid:48) = sr[i . . . i + r − 1] if s(cid:48) /∈ TNi then d1 = d1 + 1

end if if s(cid:48) /∈ TAi then d2 = d2 + 1

2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: end for 17: if d1 > t2 then output s is nonself 18: else if d2 > t3 then output s is self 19: else if d3.t4 > (cid:96) then output s is nonself 20: else output s is self 21: end if 22: 23: end procedure

end if if Leaf(s(cid:48), TNi) < Leaf(s(cid:48), TAi).t1 then d3 = d3 + 1 end if

Example 5.1. Given t1 = t2 = t3 = t4 = 2, s = 00001, (cid:96) = 5, r = 3 and a self

strings set N , a nonself strings set A as in Example 1.5, N = {00000, 10110} and

A = {10111, 11111}. Ring type of s is s(cid:48) = 0000100. Since only two substrings of s(cid:48),

s(cid:48)[1 . . . 3] ∈ TN1 and s(cid:48)[2 . . . 4] ∈ TN2, d1 = 3 > t2, d2 = 5. As a result of Algorithm 5.3,

s is classiﬁed as nonself (output in line 18).

5.4 Experiments

All experiments are performed on a laptop computer with Windows 8 Pro 64-bit,

Intel Core i5-3210M, CPU 2.50GHz (4 CPUs), 4GB RAM.

5.4.1 Datasets

In our experiments, we use two popular ﬂow-based datasets: NetFlow [86] and

TU [75]. Similar to the previous studies [86, 49, 48], we select the same 4 features from

the NetFlow dataset as the input of experiments 1, 5, 6, and 7; 7 features from the

NetFlow dataset as the input of experiments 2 and 3; 4 features from the UT dataset as

the input of experiment 4. Table 5.1 shows detailed information about these features.

Noticeably, feature Duration is the number of seconds from start time to end time of

the ﬂow.

Table 5.1: Features for NIDS.

Experiment Dataset Feature

1 2, 3

UT 4 5, 6, 7 NetFlow Packets, Octets, Duration, Flags NetFlow Packets, Octets, Duration, Scr port, Dst port, Flags, IP protocol Packets, Octets, Duration, Flags NetFlow Packets, Octets, Duration, Flags

The large number of ﬂows in UT dataset causes a time-consuming training

phase. Therefore, we only use all 5,968 unknown traﬃc as normal data and a frac-

tion, 10000, of other traﬃc as abnormal data in Experiment 4. The ratios of training

dataset to testing dataset for experiments 5, 6, and 7 are 66.6:33.3, 1:1, and 33.3:66.6,

respectively. The distribution of ﬂows in experiments is described in Table 5.2.

5.4.2 Data preprocessing

The preprocessing for the features of training dataset consists of two steps.

The ﬁrst step is to map all features to binary string features. After this step, a total

string features are constructed for both normal and anomalous data. The second one

is to concatenate the binary string features for every ﬂows. After this step, training

dataset contains equaled-length binary strings. For testing dataset, each feature will

be converted to a binary string using the map of training data. If a feature value is

not in range of corresponding training feature range, its binary form will be selected

randomly.

Example 5.2. Given a training dataset contain numerical two-feature ﬂows S =

{N, A}, where set of normal data N = {(1, 8), (0.5, 6)} and abnormal data A = {(1,

9), (1, 6), (0.5, 9)}.

The ranges of the ﬁrst feature and the second one are {0.5, 1} and {6, 8, 9},

respectively. Therefore, we can use one bit (two bits) to encode the ﬁrst (the second)

feature. A map of the numeric ranges {0.5, 1} and {6, 8, 9} can be binary string ranges

{0, 1} and {00, 01, 10}, respectively. Finally, the training dataset contains strings of

length 3, S = {N, A}, where N = {101, 000} and A = {110, 100, 010}.

Assuming a ﬂow s = (2, 8) is in testing dataset. The ﬁrst feature (2 /∈ {0.5, 1})

is randomly encoded by 0 or 1 while the second feature (8 ∈ {6, 8, 9}) is encoded by 01.

The binary form of s is 001 or 101.

5.4.3 Performance metrics and parameters

We used three metrics to evaluate each machine learning technique: Detection

rate (DR), Accuracy rate (ACC) and False alarm rate (FAR).

We used a 10-fold cross-validation technique to evaluate our approach in Exper-

iments 1 and 4. In Experiments 2, 3 and 5-7, a hold-out validation is used to evaluate

our approach with subsets of Netﬂow dataset as described in Table 5.2.

Algorithm PSA2 requires the 5 integer-valued parameters r, t1, t2, t3, t4, which

control the detection performance of NIDS. To obtain optimal values of parameters

for a given dataset, we ﬁrst predict the possible value ranges for them. Then we test

with every quintuple (r, t1, t2, t3, t4) in these ranges on a randomly sampled subset,

usually one-tenth, of the testing dataset. Finally, we retest the system on the whole

testing dataset in localities of parameters in the chosen quintuple. Regarding to 10-fold

cross-validation, the chosen quintuple in the fold 1 is used to estimate the others folds

to reduce training time. Additionally, value of (cid:96) is determined by encoding technique

in data preprocessing. Distribution of parameter values and a number of ﬂows for all

four experiments are showed in Table 5.2.

Since it is challenging to decide which objective function is the best to choose

an optimal quintuple. More speciﬁcally, optimization problem in this research is a

Table 5.2: Distribution of ﬂows and parameters for experiments.

Experiment Normal ﬂows

1 2 (cid:96) 49 63 r 10 10 t1 5 7 t2 7 9 t3 8 14 t4 9 7

3 53 9 12 16 7 8

4 5 41 49 7 14 1 3 7 37 2 11 4 4

6 48 12 4 19 7 5

7 46 11 2 30 8 3 105,033 59,980 for training, 45,053 for testing 59,980 for training, 45,053 for testing 5,968 70,022 for training, 35,011 for testing 52,516 for training, 52,517 for testing 35,011 for training, 70,022 for testing Attack ﬂows 24,538 5,952 for training, 18,586 for testing 5,952 for training, 18,586 for testing 10,000 16,359 for training, 8,179 for testing 12,269 for training, 12,269 for testing 8,179 for training, 16,359 for testing

multi-objective optimization problem: we aim to increase the DR while reducing the

FAR at the same time. We select our objective function to reach the highest DR while

keeping FAR as low as possible.

5.4.4 Performance

Table 5.3 illustrates the results of four experiments. Note that some algorithms

in [86] run with diﬀerent parameters, we only select a run with the highest detection

rate for comparison. We compare PSA2 performance with some other algorithms in

most recently published works and some ones run with the help of WEKA 3 with their

parameters optimized by meta-classiﬁer CVParameterSelection.

Table 5.3: Comparison between PSA2 and other algorithms.

Algorithms ACC DR FAR Experiment 1

PSA2 BBNN [86] SVM (linear) [86] SVM (polynomial) [86] SVM (RBF) [86] SVM (sigmoid) [86] Naive Bayes [86] Random Forest Random Tree 0.9879 0.9682 0.9273 0.9344 0.9468 0.9614 0.6777 0.9943 0.9941 0.9977 0.9992 0.9992 0.9975 0.9974 0.9974 0.9975 0.9976 0.9977 0.0146 0.0514 0.1223 0.1071 0.0839 0.0871 0.5116 0.0064 0.0067 Experiment 2

0.9676 0.9645 0.9622 0.9564 0.9369

PSA2 PSOGSA-based MLP [48] Cuckoo-based MLP [48] GSA-based MLP [48] PSO-based MLP [48] EBP-based MLP [49] Random Forest Random Tree 0.9896 0.9812 0.9796 0.9763 0.9672 0.9655 0.9309 0.7596 0.7634 0.1771 0.0013 0.0119 0.0133 0.0155 0.0203 0.0315 0 0.0001 Experiment 3 0.9626

PSA2 (10% labelled data) S4VM (10% labelled, 90% unlabelled data) [49] Random Forest (10% labelled data) Random Tree (10% labelled data) 0.9781 0.9196 0.7157 0.7151 0.0421 0.0400 0.0157 0.0384 0.0063 0.0065 Experiment 4

PSA2 Naive Bayes SVM (linear) SVM (polynomial) SVM (RBF) SVM (sigmoid) Random Forest Random Tree 0.9757 0.6832 0.7315 0.7143 0.6263 0.6263 0.9777 0.9766 0.9634 0.9972 1.0000 0.8788 1.0000 1.0000 0.9675 0.9664 0.0035 0.8668 0.7185 0.5613 0.9998 1.0000 0.0052 0.0062 Experiment 5

PSA2 Random Forest 0.9845 0.9818 0.9979 0.9997 0.0186 0.0868 Experiment 6

PSA2 Random Forest 0.9961 0.9870 0.9915 0.9998 0.0028 0.0636 Experiment 7

PSA2 Random Forest 0.9847 0.9901 0.9979 0.9999 0.0183 0.0490

The results of Experiment 1 show that PSA2 reaches best performance when

we set quintuple values (10, 5, 7, 8, 9) for parameters quintuple (r, t1, t2, t3, t4), in

which the remarkable FAR is 1.46%, and ACC is 98.79%. Moreover, DR is ranked the

second with 99.77%. BBNN and SVM (linear) have the highest DR, but their FARs are

not comparable with that of PSA2. Both Random Forest and Random Tree perform

excellently with highest ACC, admirable DR and genuine FAR.

Similar to the corresponding studies [48, 49], results of PSA2 in Experiments 2,

3 are average results from ten repeated runs of PSA2. Also, results of Experiment 2

show that PSA2 outperforms four algorithms proposed in [48] in terms of three metrics

DR, ACC and FAR. Also, PSA2 is better than both Random Forest and Random Tree

in terms of ACC and DR.

PSA2 can not train with unlabelled data, so we only use 10% labelled dataset,

5,998 attack ﬂows and 595 normal ﬂows, for training phase in Experiment 3. Meanwhile

algorithm S4VM uses 100% training dataset for training, in which 90% unlabelled

dataset and 10% labelled dataset. Results from the experiment strongly conﬁrm the

eﬃciency of PSA2 in comparison with methods proposed in [49]. Both Random Forest

and Random Tree perform badly with very low ACC and DR.

In Experiment 4, PSA2 is competitive with both Random Forest and Random

Tree. Some SVMs achieved admirable accuracies with 100%, but their FARs are very

high with approximately or even 100%. Poor FARs of these algorithms mean that they

can not verify the nature of benign traﬃc in the dataset, while PSA2, Random Forest

and Random Tree can.

In experiments 5-7, ACCs and DRs of two algorithms PSA2 and Random Forest

are similar. However, FARs of PSA2 are better (far lower) than that of Random Forest.

In all experiments, FAR of PSA2 is always the lowest among that of compared

algorithms except for Random Forest and Random Tree in experiments 1-3. This result

partly proves the consistency of PSA2 for NIDS.

Two drawbacks of PSA2 are listed: 1- It must store a map to encode each new

sample into a binary string in detection phase. The size of the map depend on the

distribution of ﬂow features’ domains. However, all four experiments show that map

storage does not exceed 1% size of training dataset. 2- Time to tune parameters is

an expensive factor in PSA2. Depending on training data, it takes about 2-5 hours to

choose optimal parameters in the experiments.

Another experiment is conducted for comparing PSA2’s performance in cases of

ring-based and linear-based datasets. In line 2 of Table 5.4, performance metrics are of

ring string-based PSA2 from Experiment 1 (line 3 in Table 5.3). The best performance

metrics for linear string-based PSA2 conducted in the same conditions are in line 3 of

the table. In this case, (cid:96) = 40 and optimal values for arguments r, t1, t2, t3, and t4 are

9, 4, 8, 7, and 10, respectively. The results show that ring string-based PSA2 is better

than linear string-based PSA2 in terms of three metrics ACC, DR, and FAR.

Table 5.4: Comparison between ring string-based PSA2 and linear string-based PSA2.

Algorithms

Ring string-based PSA2 (from Experiment 1) Linear string-based PSA2 ACC DR 0.9879 0.9833 0.9977 0.9973 FAR 0.0146 0.0199

5.5 Summary

In this chapter, we present a new PSA, called PSA2, for two-class classiﬁcation

through a series of works. It has four important features that make it unique and

alleviate some issues in NSAs. Firstly, it uses ring representation instead of linear

one for better performance in terms of both detection and accuracy rates. Secondly,

proposed algorithm used PSA with both type of data, normal samples and abnormal

ones, in a uniform framework, while other PSAs use only one type of samples. This

results good coverage of both self space and nonself one. Last but not least, the

process of parameters optimization (r, t1, t2, t3, t4,) as well as the method of using

three frequency-related parameters d1, d2 and d3, play an important role in improving

overall performance. The new method to map integer values into binary strings is the

forth algorithm’s feature.

To verify the eﬀectiveness of the proposed approach, two diﬀerent datasets are

adopted to validate this approach. The results from four experiments indicate that the

proposed approach can produce competitive and consistent classifying performance on

real datasets. Moreover, results form Experiment 3 with only 10% of training dataset

conﬁrm that PSA2 can detect anomalies in a small amount of labelled data.

In the future, we are planning to combine our algorithms with some machine

learning methods to have better detection performance, as well as reduce training time.

Moreover, it would be interesting to further develop technique on how to chose optimal

parameters as well as to integrate them in new objective functions. The main contri-

bution of this chapter is accepted to publish in proceedings of a National Conference

on Fundamental and Applied IT Research.

CONCLUSIONS

Applying computational intelligence-based techniques is an inevitable trend to

build smart IDSs. This approach helps computer network more adaptable to continu-

ously changing environment with more sophisticated attacks. In this thesis, we show

that AIS, a subﬁeld of computational intelligence, is relatively success in building NIDS

at least in simulation stage. A series of works is proposed and investigated to improve

NSAs for two popular matching rules, r -chunk and r -contiguous.

Contributions of the thesis

The major contributions of this research are:

1. Propose a ring representation of data instead of linear one for better performance

in terms of both detection rate and accuracy rate.

2. Propose an algorithm PNSA that combines two selection algorithms in a uni-

form for compact representation of data. Performance of the algorithm is highly

guaranteed by the experiment results and theoretical proof.

3. Propose a NSA with variable length of detectors, VNSA, to generate a com-

plete and non-redundant detector sets as well as reduce detectors storage and

classiﬁcation time.

4. Propose a r -chunk detector-based NSA, Chunk-NSA, and experimentally and

theoretically prove that it is r times faster in data training compared with the

most recently published algorithms.

5. Propose an algorithm PSA2 to apply a hybrid algorithm that combines PSA and

some statistical approaches to achieve better performance of intrusion detection

in compared with some recently published works.

6. Propose a data conversion to convert data into a suitable binary format.

One minor contribution of this thesis is proposing an algorithm Cont-NSA on r -

contiguous matching rule and proving that it is approximately r times faster in data

training compared with that of a recently published algorithm.

Future works

In the future, we would like to:

• Combine our algorithms with some machine learning methods to have better

detection performance.

• Further develop technique that can choose optimal parameters and integrate them

in new objective functions for hybrid NIDS.

• Improve proposed algorithms to apply them on other data types with diﬀerent

data representations, matching rule is also a future research direction.

• Further optimize Cont-NSA for better detection time O((cid:96)) and optimal training

time O(|S|(cid:96)) .

In a nutshell, the thesis has overviewed important works relating to the research topic,

has proposed some improvements of selection algorithms, and has veriﬁed the eﬀec-

tiveness of proposed algorithms by experiments and proofs. Obtained results has been

satisfying given research objectives. However, the results are also humble and should

be improved more by the PhD. student in the future. The PhD. student would like to

receive any comments from scientists, and other readers concerned about the subject

so that the result of the topic can be increasingly perfect.

Published works

A1. N. V. Truong and P. D. Lam, “Improving negative selection algorithm in artiﬁcial

immune systems for computer virus detection,” Journal of Science and Technology,

Thai Nguyen University, 72(06):53–58, 2010.

A2. N. V. Truong, V. D. Quang and T. V. Ha, “A fast r-chunk detector-based negative se-

lection algorithm,” Journal of Science and Technology, Thai Nguyen University,90(02):55–

58, 2012.

A3. N. V. Truong and T. V. Ha, “Another look at r-chunk detector-based negative

selection algorithm,” Journal of Science and Technology, Thai Nguyen University,

102(02):45–50, 2013.

A4. N. V. Truong, N. X. Hoai, and L. C. Mai, “A Novel Combination of Negative

and Positive Selection in Artiﬁcial Immune Systems,” Vietnam National Univer-

sity, Hanoi Journal of Science: Computer Science and Communication Engineering,

31(1):22–31, 2015.

A5. N. V. Truong, P. D. Lam, and V. D. Quang, “Some Improvements of Selection

Algorithms for Spam Email Filtering,” Journal of Science and Technology, Thai

Nguyen University, 151(06):85–91, 2016.

A6. N. V. Truong, N. X. Hoai, “An improved positive selection algorithm for ﬂow-based intrusion detection,” Proceedings of the The 2nd National Conference on Fundamen-

tal and Applied IT Research (FAIR), 2019 (Accepted).

Bibliography

[1] DARPA Dataset. https://www.ll.mit.edu/r-d/datasets. [accessed 20-Mar-2019].

[2] FDFA Datasets. http://www.unsw.adfa.edu.au/australian-centre-for-cyber-

security/cybersecurity/ADFA-IDS-Datasets/. [accessed 20-July-2018].

[3] KDD99 Dataset. http://kdd.ics.uci.edu/databases/kddcup99. [accessed 20-Mar-

2018].

[4] NSL-KDD Dataset. https://www.unb.ca/cic/datasets/nsl.html. [accessed 25-

July-2019].

[5] S. Afaneha, R. A. Zitarb, and A. A. Hamamic. Virus detection using clonal selec-

tion algorithm with genetic algorithm (VDC algorithm). Applied Soft Computing,

13:239–246, 2013.

[6] M. Ayara, J. Timmis, R. de Lemos, L. N. de Castro, and R. Duncan. Nega-

tive selection: How to generate detectors. In Proceedings of the 1st International

Conference on Artiﬁcial Immune Systems (ICARIS), pages 89–98, 2002.

[7] A. S. A. Aziz, M. Salama, A. ella Hassanien, and S. E. O. Haraﬁ. Detectors gen-

eration using genetic algorithm for a negative selection inspired anomaly network

intrusion detection system. In Proceedings of the FedCSIS, pages 597–602, 2012.

[8] K. Bache and M. Lichman. UCI Machine Learning Repository.

http://archive.ics.uci.edu/ml. [accessed 20-July-2016].

[9] J. Balthrop, F. Esponda, S. Forrest, and M. Glickman. Coverage and generaliza-

tion in an artiﬁcial immune system. In Proceedings of Genetic and Evolutionary

Computation Conference (GECCO), pages 3–10, 2002.

[10] J. Balthrop, S. Forrest, and M. Glickman. Revisiting LISYS: Parameters and

normal behavior. In Proceedings of the Congress on evolutionary computation,

pages 1045–1050, 2002.

[11] F. Barani. A hybrid approach for dynamic intrusion detection in ad hoc networks

using genetic algorithm and artiﬁcial immune system. In Proceedings of the Iranian

Conference on Intelligent Systems (ICIS), pages 1–6, 2014.

[12] D. K. Bhattacharyya and J. K. Kalita. Network anomaly detection: A machine

learning perspective. CRC Press, 2013.

[13] M. H. Bhuyan, D. K. Bhattacharyya, and J. K. Kalita. Network anomaly de-

tection: methods, systems and tools. IEEE communications surveys & tutorials,

16(1):303–336, 2014.

[14] R. Bronte, H. Shahriar, and H. M. Haddad. A signature-based intrusion detection

system for web applications based on genetic algorithm. In Proceedings of the

International Conference on Security of Information and Networks, pages 32–39,

2016.

[15] T. C. Butler, M. Kardar, and A. K. Chakraborty. Quorum sensing allows T cells

to discriminate between self and nonself. Proceedings of the National Academy of

Sciences, 110(29):11833–11838, 2013.

[16] C. Callegari and N. Cyprus. Statistical approaches for network anomaly detection.

In Proceedings of the 4th International Conference on Internet Monitoring and

Protection (ICIMP), pages 24–28, 2009.

[17] M. J. Chapple, T. E. Wright, and R. M. Winding. Flow anomaly detection in

ﬁrewalled networks. In Proceedings of the Securecomm and Workshops, pages 1–6,

2006.

[18] S. Chen. Optimized multilevel immune learning algorithm in abnormal detection.

Information Technology Journal, 12(3):514–517, 2013.

[19] D. Dasgupta. Artiﬁcial Immune Systems and Their Applications. Springer-Verlag,

Berlin Heidelberg, 1998.

[20] D. Dasgupta and R. Azeem. An investigation of negative authentication systems.

In Proceedings of the 3rd International Conference on Information Warfare and

Security, pages 117–126, 2008.

[21] D. Dasgupta and Y. Cao. An immunogenetic approach to spectra recognition. In

Proceedings of the Genetic and Evolutionary Computation Conference (GECCO),

pages 149–155, 1999.

[22] D. Dasgupta and S. Forrest. Novelty detection in time series data using ideas

from immunology. In Proceedings of the International Conference on Intelligent

Systems, pages 82–87, 1996.

[23] D. Dasgupta and F. Gonzalez. An immunity-based technique to characterize in-

trusions in computer networks. IEEE Transactions on Evolutionary Computation,

6:281–291, 2002.

[24] D. Dasgupta and F. Nino. A comparison of negative and positive selection algo-

rithms in novel pattern detection. In Proceedings of the International Conference

on Systems, Man, and Cybernetics, pages 125–130, 2000.

[25] D. Dasgupta and S. Saha. Password security through negative ﬁltering. In Proceed-

ings of the International Conference on Emerging Security Technologies (EST),

pages 83–89, 2010.

[26] D. Dasgupta, S. Yu, and N. S. Majumdar. MILA-multilevel immune learning al-

gorithm. In Proceedings of the Genetic and Evolutionary Computation Conference

(GECCO), pages 183–194. Springer, 2003.

[27] L. N. de Castro and J. Timmis. Articial Immune Systems: A New Computational

Intelligence Approach. Springer-Verlag New York, Inc. Secaucus, NJ, USA, 2002.

[28] K. S. Desale and R. Ade. Genetic algorithm based feature selection approach for

eﬀective intrusion detection system. In Proceedings of the International Conference

on Computer Communication and Informatics (ICCCI), pages 1–6, 2015.

[29] P. Dhaeseleer. An immunological approach to change detection: Theoretical re-

sults. In Proceedings of the IEEE Computer Security Foundations Workshop, pages

18–26, 1996.

[30] P. D’haeseleer, S. Forrest, and P. Helman. An immunological approach to change

detection: algorithms, analysis and implications. In Proceedings of IEEE Sympo-

sium on Security and Privacy, pages 110–119, 1996.

[31] M. Elberfeld and J. Textor. Eﬃcient algorithms for string-based negative selec-

tion. In Proceedings of the International Conference on Artiﬁcial Immune Systems,

pages 109–121, 2009.

[32] M. Elberfeld and J. Textor. Negative selection algorithms on strings with eﬃcient

training and linear-time classiﬁcation. Theoretical Computer Science, 412(6):534

– 542, 2011.

[33] F. Esponda, E. S. Ackley, and S. Forrest. Online negative databases. In Proceedings

of the International Conference on Artiﬁcial Immune Systems (ICARIS), pages

175–188. Springer, 2004.

[34] F. Esponda, S. Forrest, and P. Helman. A formal framework for positive and neg-

ative detection schemes. IEEE Transactions on Systems, Man, and Cybernetics,

Part B (Cybernetics), 34(1):357–373, 2004.

[35] D. A. Fernandes, M. M. Freire, P. A. Fazendeiro, and P. R. In´acio. Applications of

artiﬁcial immune systems to computer security: A survey. Journal of Information

Security and Applications, 35:138 – 159, 2017.

[36] S. Forrest, S. A. Hofmeyr, A. Somayaji, and T. A. Longstaﬀ. A sense of self for

UNIX processes. In Proceedings of the IEEE Symposium on Research in Security

and Privacy, pages 120–128, 1996.

[37] S. Forrest, B. Javornik, R. E. Smith, and A. S. Perelson. Using genetic algorithms

to explore pattern recognition in the immune system. Evolutionary Computation,

1:191–211, 1993.

[38] Forrest, Stephanie and Perelson, Alan S. and Allen, Lawrence and Cherukuri,

Rajesh. Self-nonself discrimination in a computer. In Proceedings of the IEEE

Symposium on Security and Privacy, pages 202–212, 1994.

[39] Z. Fuyong and Q. Deyu. Run-time malware detection based on positive selection.

Journal in Computer Virology, 7:267–277, 2011.

[40] Z. Fuyong and Q. Deyu. A positive selection algorithm for classiﬁcation. Journal

Computational Information Systems, 7:207–215, 2012.

[41] A. A. Ghorbani, W. Lu, and M. Tavallaee. Network intrusion detection and pre-

vention: concepts and techniques. Springer Science & Business Media, 2009.

[42] F. Gonz´alez, D. Dasgupta, and J. G´omez. The eﬀect of binary matching rules in

negative selection. In Proceedings of the Genetic and Evolutionary Computation

Conference (GECCO), pages 195–206, 2003.

[43] C. Guo, Y.-J. Zhou, Y. Ping, S.-S. Luo, Y.-P. Lai, and Z.-K. Zhang. Eﬃcient

intrusion detection using representative instances. Computers & Security, 39,

Part B:255 – 267, 2013.

[44] X. Hang and H. Dai. Applying both positive and negative selection to supervised

learning for anomaly detection. In Proceedings of the Conference on Genetic and

Evolutionary Computation (GECCO), pages 345–352, 2005.

[45] P. K. Harmer, P. D. Williams, G. H. Gunsch, and G. B. Lamont. An artiﬁcial im-

mune system architecture for computer security applications. IEEE Transactions

on Evolutionary Computation, 6(3):252–280, 2002.

[46] S. Hofmeyr. An immunological model of distributed detection and its application

to computer security. PhD thesis, The University of New Mexico, ALbuquerque,

NM, 1999.

[47] S. B. Inadyuti Dutt and I. Maitra. Intrusion detection system using artiﬁcial

immune system. International Journal of Computer Applications, 144(12):19–22,

2016.

[48] Z. Jadidi, V. Muthukkumarasamy, and E. Sithirasenan. Metaheuristic algorithms

based ﬂow anomaly detector. In Proceedings of the Asia-Paciﬁc Conference on

Communications (APCC), pages 717–722, 2013.

[49] Z. Jadidi, V. Muthukkumarasamy, E. Sithirasenan, and K. Singh. Flow-based

anomaly detection using semisupervised learning. In Proceedings of the Interna-

tional Conference on Signal Processing and Communication Systems (ICSPCS),

pages 1–5, 2015.

[50] Z. Ji. Negative Selection Algorithms: from the Thymus to V-detector. PhD thesis,

The University of Memphis, 2006.

[51] Z. Ji and D. Dasgupta. Revisiting negative selection algorithms. Evolutionary

Computation, 15:223–251, 2007.

[52] L. Jim and M. Gregory. A review of artiﬁcial immune system based security

frameworks for manet. International Journal of Communications, Network and

System Sciences, 9(1):1–18, 2016.

[53] K. Jungwon. Integrating Articial Immune Algorithms for Intrusion Detection.

PhD thesis, University College London, 2002.

[54] J. Kim and P. J. Bentley. An Evaluation of Negative Selection in an Artiﬁcial

Immune System for Network Intrusion Detection. In Proceedings of the Genetic

and Evolutionary Computation Conference (GECCO), pages 1330–1337, 2001.

[55] J. Kim, P. J. Bentley, U. Aickelin, J. Greensmith, G. Tedesco, and J. Twycross.

Immune system approaches to intrusion detection – a review. Natural Computing,

6(4):413–466, 2007.

[56] A. Koˇsmrlj, A. K. Jha, E. S. Huseby, M. Kardar, and A. K. Chakraborty. How

the thymus designs antigen-speciﬁc and self-tolerant T cell receptor sequences.

Proceedings of the National Academy of Sciences, 105(43):16671–16676, 2008.

[57] A. Koˇsmrlj, E. L. Read, Y. Qi, T. M. Allen, M. Altfeld, S. G. Deeks, F. Pereyra,

M. Carrington, B. D. Walker, and A. K. Chakraborty. Eﬀects of thymic selection

of the T-cell repertoire on HLA class I-associated control of HIV infection. Nature,

465(7296):350–354, 2010.

[58] V. D. Kotov and V. Vasilyev. Immune model based approach for network in-

trusion detection. In Proceedings of the International Conference on Security of

Information and Networks (SIN), pages 233–237, 2010.

[59] W. Ma, D. Tran, and D. Sharma. Negative selection with antigen feedback in

intrusion detection. In Proceedings of the International Conference on Artiﬁcial

Immune Systems (ICARIS), pages 200–209, 2008.

[60] M. V. Mahoney and P. K. Chan. An analysis of the 1999 DARPA/Lincoln Lab-

oratory evaluation data for network anomaly detection. In Proceedings of the In-

ternational Workshop on Recent Advances in Intrusion Detection, pages 220–237,

2003.

[61] C. A. Mart´ınez, G. I. Echeverri, and A. G. C. Sanz. Malware detection based on

cloud computing integrating intrusion ontology representation. In Proceedings of

the IEEE Latin-American Conference on Communications, pages 1–6, 2010.

[62] J. McHugh. Testing intrusion detection systems: A critique of the 1998 and 1999

DARPA intrusion detection system evaluations as performed by Lincoln Labo-

ratory. ACM Transactions on Information and System Security, 3(4):262–294,

2000.

[63] T. Mehmod and H. B. M. Rais. Ant colony optimization and feature selection

for intrusion detection. In Advances in Machine Learning and Signal Processing,

pages 305–312. Springer, 2016.

[64] R. Murugesan and V. N. Kumar. A Fast Algorithm for Solving JSSP. European

Journal of Scientiﬁc Research, 64:579–586, 2011.

[65] P. Ning and S. Jajodia. Intrusion detection techniques. The Internet Encyclopedia,

2003.

[66] L. X. Peng and Y. F. Chen. Positive selection-inspired anomaly detection model

with artiﬁcial immune. In Proceedings of the International Conference on Cyber-

Enabled Distributed Computing and Knowledge Discovery (CyberC), pages 56–59,

2014.

[67] P. H. Pisani, A. C. Lorena, and A. C. Carvalho. Adaptive positive selection for

keystroke dynamics. J. Intell. Robotics Syst., 80(1):277–293, 2015.

[68] D. Poole, A. Mackworth, and R. Goebel. Computational Intelligence: A Logical

Approach. Oxford University Press, Oxford, UK, 1997.

[69] Y. Sawaya, A. Kubota, and Y. Miyake. Detection of attackers in services using

anomalous host behavior based on traﬃc ﬂow statistics. In Proceedings of the

International Symposium on Applications and the Internet (SAINT), pages 353–

359, 2011.

[70] M. Sheikhan and Z. Jadidi. Flow-based anomaly detection in high-speed links us-

ing modiﬁed GSA-optimized neural network. Neural Computing and Applications,

24(3-4):599–611, 2014.

[71] G. C. Silva and D. Dasgupta. A Survey of Recent Works in Artiﬁcial Immune

Systems, pages 547–586. World Scientiﬁc, 2016.

[72] G. C. Silva, R. M. Palhares, and W. M. Caminhas. Immune inspired fault detection

and diagnosis: A fuzzy-based approach of the negative selection algorithm and

participatory clustering. Expert Systems with Applications, 39:12474–12486, 2012.

[73] K. B. Sim and D. W. Lee. Modeling of Positive Selection for the Development

of a Computer Immune System and a Self-Recognition Algorithm. International

Journal of Control, Automation, and Systems, 1:453–458, 2003.

[74] T. S. Sobh and W. M. Mostafa. A cooperative immunological approach for de-

tecting network anomaly. Applied Soft Computing, 11(1):1275 – 1283, 2011.

[75] A. Sperotto, R. Sadre, F. Vliet, and A. Pras. A labeled data set for ﬂow-based

intrusion detection. In Proceedings of the IEEE International Workshop on IP

Operations and Management, pages 39–50, 2009.

[76] A. Sperotto, G. Schaﬀrath, R. Sadre, C. Morariu, A. Pras, and B. Stiller. An

Overview of IP Flow-Based Intrusion Detection. IEEE Communications Surveys

Tutorials, 12(3):343–356, 2010.

[77] T. Stibor. On the appropriateness of negative selection for anomaly detection and

network intrusion detection. PhD thesis, TU Darmstadt, 2006.

[78] T. Stibor, K. M. Bayarou, and C. Eckert. An investigation of R-chunk detector

generation on higher alphabets. In Proceedings of the Genetic and Evolutionary

Computation Conference (GECCO), pages 299–307, 2004.

[79] T. Stibor, P. Mohr, J. Timmis, and C. Eckert. Is negative selection appropriate for

anomaly detection? In Proceedings of the Genetic and Evolutionary Computation

Conference (GECCO), pages 321–328, 2005.

[80] T. Stibor, J. Timmis, and C. Eckert. A comparative study of real-valued negative

selection to statistical anomaly detection techniques. Lecture notes in Computer

science, 3627:262–275, 2005.

[81] Y. Tan. Anti-Spam Techniques Based on Artiﬁcial Immune System. CRC Press,

2016.

[82] J. Textor. Eﬃcient negative selection algorithms by sampling and approximate

counting. In Proceedings of the International Conference on Parallel Problem

Solving from Nature, pages 32–41. Springer Berlin Heidelberg, 2012.

[83] J. Textor. Search and learning in the immune system: models of immune surveil-

lance and negative selection. PhD thesis, L¨ubeck, University, 2012.

[84] J. Textor, K. Dannenberg, and M. Li´skiewicz. A generic ﬁnite automata based

approach to implementing lymphocyte repertoire models. In Proceedings of the

Conference on Genetic and Evolutionary Computation (GECCO), pages 129–136,

2014.

[85] J. Timmis, A. Hone, T. Stibor, and E. Clark. Theoretical advances in artiﬁcial

immune systems. Theoretical Computer Science, 403(1):11–32, 2008.

[86] Q. A. Tran, F. Jiang, and J. Hu. A real-time netFlow-based intrusion detec-

tion system with improved BBNN and high-frequency ﬁeld programmable gate

arrays. In Proceedings of the IEEE International Conference on Trust, Security

and Privacy in Computing and Communications, pages 201–208, 2012.

[87] J. Vykopal. Flow-based Brute-force Attack Detection in Large and High-speed

Networks. Dissertations, Masaryk University, 2013.

[88] Y. Wang. Statistical Techniques for Network Security: Modern Statistically-Based

Intrusion Detection and Protection. IGI Global, 2008.

[89] S. T. Wierzcho´n. Discriminative power of the receptors activated by k-contiguous

bits rule. Journal of Computer Science and Technology, 1(3):1–13, 2000.

[90] S. T. Wierzcho´n. Generating optimal repertoire of antibody strings in an artiﬁ-

cial immune system. In Proceedings of the Symposium on Intelligent Information

Systems, pages 119–133, 2000.

[91] P. Winter, E. Hermann, and M. Zeilinger. Inductive intrusion detection in ﬂow-

based network data using one-class support vector machines. In Proceedings of the

International Conference on New Technologies, Mobility and Security (NTMS),

pages 1–5, 2011.

[92] S. X. Wu and W. Banzhaf. The use of computational intelligence in intrusion

detection systems: A review. Applied Soft Computing, 10(1):1–35, 2010.

[93] B. Xu, W. Luo, X. Pei, M. Zhang, and X. Wang. On average time complexity of

evolutionary negative selection algorithms for anomaly detection. In Proceedings

of the First ACM/SIGEVO Summit on Genetic and Evolutionary Computation,

pages 631–638. ACM, 2009.

[94] H. Yang, T. Li, X. Hu, F. Wang, and Y. Zou. A survey of artiﬁcial immune system

based intrusion detection. The Scientiﬁc World Journal, 2014.

[95] X. Yang and Z. Hui. Intrusion detection alarm ﬁltering technology based on ant

colony clustering algorithm. In Proceedings of the International Conference on

Intelligent Systems Design and Engineering Applications (ISDEA), pages 470–

473, 2015.

[96] D. Y. Yeung and Y. Ding. Host-based intrusion detection using dynamic and

static behavioral models. Pattern Recognition, 36:229–243, 2003.

[97] Zeng, Jie and Liu, Xiaojie and Li, Tao and Li, Guiyang and Li, Haibo and Zeng,

Jinquan. A novel intrusion detection approach learned from the change of antibody

concentration in biological immune response. Applied Intelligence, 35(1):41–62,

2011.

[98] D. Zhao and W. Luo. Real-valued negative databases. In Proceedings of the

European Conference on Artiﬁcial Life (ECAL), pages 884–890, 2013.

[99] M. Zolotukhin, T. Hmlinen, T. Kokkonen, and J. Siltanen. Online detection of

anomalous network ﬂows with soft clustering. In Proceedings of the International

Conference on New Technologies, Mobility and Security (NTMS), pages 1–5, 2015.

The thesis for the degree of doctor of Philosophy in mathematics: Improving some artificial immune algorithms for network intrusion detection

IMPROVING SOME ARTIFICIAL IMMUNE ALGORITHMS FOR NETWORK INTRUSION DETECTION

THE THESIS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN MATHEMATICS

IMPROVING SOME ARTIFICIAL IMMUNE ALGORITHMS FOR NETWORK INTRUSION DETECTION

THE THESIS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN MATHEMATICS Major: Mathematical foundations for Informatics Code: 62 46 01 10

Acknowledgments

Certiﬁcate of Originality

Hanoi, 2019

PhD. student

Nguyen Van Truong

Contents

List of Figures

List of Tables

Notation and Abbreviation

Notation

Abbreviation

INTRODUCTION

Motivation

Objectives

Problem statements

Outline of thesis

Chapter 1

BACKGROUND

1.1 Detection of Network Anomalies

1.1.1 Host-Based IDS

1.1.2 Network-Based IDS

1.1.3 Methods

1.1.4 Tools

1.2 A brief overview of human immune system

1.3 AIS for IDS

1.3.1 AIS model for IDS

1.3.2 AIS features for IDS

1.4 Selection algorithms

1.4.1 Negative Selection Algorithms

1.4.2 Positive Selection Algorithms

1.5 Basic terms and deﬁnitions

1.5.1 Strings, substrings and languages

1.5.2 Preﬁx trees, preﬁx DAGs and automata

1.5.3 Detectors

1.5.4 Detection in r-chunk detector-based positive selection

1.5.5 Holes

1.5.6 Performance metrics

1.5.7 Ring representation of data

1.5.8 Frequency trees

1.6 Datasets

1.6.1 The DARPA-Lincoln datasets

1.6.2 UT dataset

1.6.3 Netﬂow dataset

1.6.4 Discussions

1.7 Summary

Chapter 2

COMBINATION OF NEGATIVE SELECTION AND POSITIVE SELECTION

2.1

Introduction

2.2 Related works

2.3 New Positive-Negative Selection Algorithm

2.4 Experiments

2.5 Summary

Chapter 3

GENERATION OF COMPACT DETECTOR SET

3.1

Introduction

3.2 Related works

3.3 New negative selection algorithm

3.3.1 Detectors set generation under rcbvl matching rule

3.3.2 Detection under rcbvl matching rule

3.4 Experiments

3.5 Summary

Chapter 4

FAST SELECTION ALGORITHMS

4.1

Introduction

4.2 Related works

4.3 A fast negative selection algorithm based on r-

chunk detector

4.4 A fast negative selection algorithm based on r-

contiguous detector

4.5 Experiments

4.6 Summary

Chapter 5