The thesis for the degree of doctor of Philosophy in mathematics: Improving some artificial immune algorithms for network intrusion detection

Chia sẻ: _ _ | Ngày: | Loại File: PDF | Số trang:103

Thêm vào BST

Báo xấu

19
lượt xem 4
download

Download Vui lòng tải xuống để xem tài liệu đầy đủ

Since data representation is one of the factors that affect the training and testing time, a compact and complete detector generation algorithm is investigated; the thesis investigates optimal algorithms to generate detector set in AIS; they help to reduce both training time and detecting time of AIS-based IDSs.

Chủ đề:

Bình luận(0) Đăng nhập để gửi bình luận!

Lưu

Nội dung Text: The thesis for the degree of doctor of Philosophy in mathematics: Improving some artificial immune algorithms for network intrusion detection

MINISTRY OF EDUCATION VIETNAMESE ACADEMY AND TRAINING OF SCIENCE AND TECHNOLOGY GRADUATE UNIVERSITY OF SCIENCE AND TECHNOLOGY ———————————— NGUYEN VAN TRUONG IMPROVING SOME ARTIFICIAL IMMUNE ALGORITHMS FOR NETWORK INTRUSION DETECTION THE THESIS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN MATHEMATICS Hanoi - 2019
MINISTRY OF EDUCATION VIETNAMESE ACADEMY AND TRAINING OF SCIENCE AND TECHNOLOGY GRADUATE UNIVERSITY OF SCIENCE AND TECHNOLOGY ———————————— NGUYEN VAN TRUONG IMPROVING SOME ARTIFICIAL IMMUNE ALGORITHMS FOR NETWORK INTRUSION DETECTION THE THESIS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN MATHEMATICS Major: Mathematical foundations for Informatics Code: 62 46 01 10 Scientific supervisor: 1. Assoc. Prof., Dr. Nguyen Xuan Hoai 2. Assoc. Prof., Dr. Luong Chi Mai Hanoi - 2019
Acknowledgments First of all I would like to thank is my principal supervisor, Assoc. Prof., Dr. Nguyen Xuan Hoai for introducing me to the field of Artificial Immune System. He guides me step by step through research activities such as seminar presentations, paper writing, etc. His genius has been a constant source of help. I am intrigued by his constructive criticism throughout my PhD. journey. I wish also to thank my co-supervisor, Assoc. Prof., Dr. Luong Chi Mai. She is always very enthusiastic in our discussion promising research questions. It is a pleasure and luxury for me to work with her. This thesis could not have been possible without my supervisors’ support. I gratefully acknowledge the support from Institute of Information Technology, Vietnamese Academy of Science and Technology, and from Thai Nguyen University of Education. I thank the financial support from the National Foundation for Science and Technology Development (NAFOSTED), ASEAN-European Academic University Network (ASEA-UNINET). I thank M.Sc. Vu Duc Quang, M.Sc. Trinh Van Ha and M.Sc. Pham Dinh Lam, my co-authors of published papers. I thank Assoc. Prof., Dr. Tran Quang Anh and Dr. Nguyen Quang Uy for many helpful insights for my research. I thank colleagues, especially my cool labmate Mr. Nguyen Tran Dinh Long, in IT Research & Development Center, HaNoi University. Finally, I thank my family for their endless love and steady support.
Certificate of Originality I hereby declare that this submission is my own work under my scientific super- visors, Assoc. Prof., Dr. Nguyen Xuan Hoai, and Assoc. Prof., Dr. Luong Chi Mai. I declare that, it contains no material previously published or written by another person, except where due reference is made in the text of the thesis. In addition, I certify that all my co-authors allow me to present our work in this thesis. Hanoi, 2019 PhD. student Nguyen Van Truong
i Contents List of Figures v List of Tables vii Notation and Abbreviation viii INTRODUCTION 1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Problem statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Outline of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1 BACKGROUND 5 1.1 Detection of Network Anomalies . . . . . . . . . . . . . . . . . . . . . . 5 1.1.1 Host-Based IDS . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.1.2 Network-Based IDS . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.1.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.1.4 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2 A brief overview of human immune system . . . . . . . . . . . . . . . . 8 1.3 AIS for IDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3.1 AIS model for IDS . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3.2 AIS features for IDS . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4 Selection algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.4.1 Negative Selection Algorithms . . . . . . . . . . . . . . . . . . . 12
ii 1.4.2 Positive Selection Algorithms . . . . . . . . . . . . . . . . . . . 15 1.5 Basic terms and definitions . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.5.1 Strings, substrings and languages . . . . . . . . . . . . . . . . . 16 1.5.2 Prefix trees, prefix DAGs and automata . . . . . . . . . . . . . 17 1.5.3 Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.5.4 Detection in r-chunk detector-based positive selection . . . . . . 20 1.5.5 Holes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 1.5.6 Performance metrics . . . . . . . . . . . . . . . . . . . . . . . . 22 1.5.7 Ring representation of data . . . . . . . . . . . . . . . . . . . . 23 1.5.8 Frequency trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.6 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 1.6.1 The DARPA-Lincoln datasets . . . . . . . . . . . . . . . . . . . 27 1.6.2 UT dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.6.3 Netflow dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 1.6.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 1.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2 COMBINATION OF NEGATIVE SELECTION AND POSITIVE SE- LECTION 30 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.3 New Positive-Negative Selection Algorithm . . . . . . . . . . . . . . . . 31 2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3 GENERATION OF COMPACT DETECTOR SET 43 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3 New negative selection algorithm . . . . . . . . . . . . . . . . . . . . . 45
iii 3.3.1 Detectors set generation under rcbvl matching rule . . . . . . . 45 3.3.2 Detection under rcbvl matching rule . . . . . . . . . . . . . . . 48 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4 FAST SELECTION ALGORITHMS 51 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3 A fast negative selection algorithm based on r-chunk detector . . . . . . 52 4.4 A fast negative selection algorithm based on r-contiguous detector . . . 57 4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5 APPLYING HYBRID ARTIFICIAL IMMUNE SYSTEM FOR NET- WORK SECURITY 66 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.3 Hybrid positive selection algorithm with chunk detectors . . . . . . . . 69 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.4.2 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.4.3 Performance metrics and parameters . . . . . . . . . . . . . . . 72 5.4.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 CONCLUSIONS 78 Contributions of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Future works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Published works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
iv BIBLIOGRAPHY 81
v List of Figures 1.1 Classification of anomaly-based intrusion detection methods . . . . . . 7 1.2 Multi-layered protection and elimination architecture . . . . . . . . . . 9 1.3 Multi-layer AIS model for IDS . . . . . . . . . . . . . . . . . . . . . . . 10 1.4 Outline of a typical negative selection algorithm. . . . . . . . . . . . . . 13 1.5 Outline of a typical positive selection algorithm. . . . . . . . . . . . . . 15 1.6 Example of a prefix tree and a prefix DAG. . . . . . . . . . . . . . . . . 18 1.7 Existence of holes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.8 Negative selections with 3-chunk and 3-contiguous detectors. . . . . . . 23 1.9 A simple ring-based representation (b) of a string (a). . . . . . . . . . . 25 1.10 Frequency trees for all 3-chunk detectors. . . . . . . . . . . . . . . . . . 26 2.1 Binary tree representation of the detectors set generated from S. . . . . 33 2.2 Conversion of a positive tree to a negative one. . . . . . . . . . . . . . . 33 2.3 Diagram of the Detector Generation Algorithm. . . . . . . . . . . . . . 35 2.4 Diagram of the Positive-Negative Selection Algorithm. . . . . . . . . . 37 2.5 One node is reduced in a tree: a compact positive tree has 4 nodes (a) and its conversion (a negative tree) has 3 node (b). . . . . . . . . . . . 38 2.6 Detection time of NSA and PNSA. . . . . . . . . . . . . . . . . . . . . 40 2.7 Nodes reduction on trees created by PNSA on Netflow dataset. . . . . . 41 2.8 Comparison of nodes reduction on Spambase dataset. . . . . . . . . . . 41 3.1 Diagram of a algorithm to generate perfect rcbvl detectors set. . . . . . 47 4.1 Diagram of the algorithm to generate positive r-chunk detectors set. . . 55
vi 4.2 A prefix DAG G and an automaton M . . . . . . . . . . . . . . . . . . 57 4.3 Diagram of the algorithm to generate negative r-contiguous detectors set. 61 4.4 An automaton represents 3-contiguous detectors set. . . . . . . . . . . . 62 4.5 Comparison of ratios of runtime of r-chunk detector-based NSA to run- time of Chunk-NSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.6 Comparison of ratios of runtime of r-contiguous detector-based NSA to runtime of Cont-NSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
vii List of Tables 1.1 Performance comparison of NSAs on linear strings and ring strings. . . 24 2.1 Comparison of memory and detection time reductions. . . . . . . . . . 39 2.2 Comparison of nodes generation on Netflow dataset. . . . . . . . . . . . 40 3.1 Data and parameters distribution for experiments and results comparison. 49 4.1 Comparison of our results with the runtimes of previously published algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2 Comparison of Chunk-NSA with r-chunk detector-based NSA. . . . . . 63 4.3 Comparison of proposed Cont-NSA with r-contiguous detector-based NSA. 64 5.1 Features for NIDS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2 Distribution of flows and parameters for experiments. . . . . . . . . . . 73 5.3 Comparison between PSA2 and other algorithms. . . . . . . . . . . . . 74 5.4 Comparison between ring string-based PSA2 and linear string-based PSA2. 76
viii Notation and Abbreviation Notation ` Length of data samples Sr Set of ring presentations of all strings in S |X| Cardinality of set X Σ An alphabet, a nonempty and finite set of symbols Σk Set of all strings of length k on alphabet Σ, where k is a positive integer. Σ∗ Set of all strings on alphabet Σ, including an empty string. r Matching threshold Dpi Set of all positive r-chunk detectors at position i. Dni Set of all negative r-chunk detectors at position i. CHUNKp (S, r) Set of all positive r-chunk detectors. CHUNK(S, r) Set of all negative r-chunk detectors. CONT(S, r) Set of all r-contiguous detectors. L(X) Set of all nonself strings detected by X. rcbvl r-contiguous bit with variable length. Abbreviation AIS Artificial Immune System ACC Accuracy Rate ACO Ant Colony Optimization ANIDS Anomaly Network Intrusion Detection System BBNN Block-Based Neural Network Chunk-NSA Chunk Detector-Based Negative Selection Algorithm Cont-NSA Contiguous Detector-Based Negative Selection Algorithm DR Detection Rate DAG Directed Acyclic Graph FAR False Alarm Rate GA Genetic Algorithm HIS Human Immune System HIDS Host Intrusion Detection System IDS Intrusion Detection System
ix ML Machine Learning MLP Multilayer Perceptron NIDS Network Intrusion Detection System NS Negative Selection NSA Negative Selection Algorithm NSM Negative Selection Mutation PNSA Positive-Negative Selection Algorithm PSA Positive Selection Algorithm PSA2 Two-class Positive Selection Algorithm PSO Particle Swarm Optimization PSOGSA Particle Swarm Optimization-Gravitational Search Algorithm RNSA Real-valued NSA SVM Support Vector Machines TCP Transmission Control Protocol VNSA Variable length detector-based NSA
1 INTRODUCTION Motivation Internet users and computer networks are suffering from rapidly increasing num- ber of attacks. In order to keep them safe, there is a need for effective security monitor- ing systems, such as Intrusion Detection Systems (IDS). However, intrusion detection has to face a number of different problems such as large network traffic volumes, im- balanced data distribution, difficulties to realize decision boundaries between normal and abnormal actions, and a requirement for continuous adaptation to a constantly changing environment. As a result, many researchers have attempted to use different types of approaches to build reliable intrusion detection system. Computational intelligence techniques, known for their ability to adapt and to exhibit fault tolerance, high computational speed and resilience against noisy informa- tion, are hopefully alternative methods to the problem. One of the promising computational intelligence methods for intrusion detection that have emerged recently are artificial immune systems (AIS) inspired by the biolog- ical immune system. Negative selection algorithm (NSA), a dominating model of AIS, is widely used for intrusion detection systems (IDS) [55, 52]. Despite its successful application, NSA has some weaknesses: 1-High false positive rate (false alarm rate) and false negative rate, 2-High training and testing time, 3-Exponential relationship between the size of the training data and the number of detectors possibly generated for testing, 4-Changeable definitions of ”normal data” and ”abnormal data” in dynamic network environment [55, 79, 92]. To overcome these limitations, trends of recent works are to concentrate on complex structures of immune detectors, matching methods and hybrid NSAs [11, 94, 52]. Following trends mentioned above, in this thesis we investigate the ability of NSA to combine with other classification methods and propose more effective data
2 representations to improve some NSA’s weaknesses. Scientific meaning of the thesis: to provide further background to improve per- formance of AIS-based computer security field in particular and IDS in general. Reality meaning of the thesis: to assist computer security practicers or experts implement their IDS with new features from AIS origin. The major contributions of this research are: Propose a new representation of data for better performance of IDS; Propose a combination of existing algorithms as well as some statistical approaches in an uniform framework; Propose a complete and non-redundant detector representation to archive optimal time and memory complex- ities. Objectives Since data representation is one of the factors that affect the training and testing time, a compact and complete detector generation algorithm is investigated. The thesis investigates optimal algorithms to generate detector set in AIS. They help to reduce both training time and detecting time of AIS-based IDSs. Also, it is regarded to propose and investigate an AIS-based IDS that can promptly detect attacks, either if they are known or never seen before. The proposed system makes use of AIS with statistics as analysis methods and flow-based network traffic as experimental data. Problem statements Since the NSA has some limitations as listed in the first section, this thesis concentrates on three problems: 1. The first problem is to find compact representations of data. Objectives of this problem’s solution is not only to minimize memory storage but also to reduce testing time. 2. The second problem is to propose algorithms that can reduce training time and testing time in compared with all existing related algorithms.
3 3. The third problem is to improve detection performance with respect to reduc- ing false alarm rates while keeping detection rate and accuracy rate as high as possible. Solutions of these problems can partly improve first three weaknesses as listed in the first section. Regarding to the last NSAs’ weakness about changeable definitions of ”normal data” and ”abnormal data” in dynamic network environment, we consider it as a risk in our proposed algorithm and left it for future work. Logically, it is impossible to find an optimal algorithm that can both reduce time and memory complexities and obtain best detection performance. These aspects are always in conflict with each other. Thus, in each chapter, we will propose algorithms to solve each problem quite independently. The intrusion detection problem mentioned in this thesis can be informally stated as: Given a finite set S of network flows which labeled with self (normal) or nonself (abnormal). The objective is to build classifying models on S that can label exactly an unlabeled network flow s. Outline of thesis The first chapter introduces the background knowledge necessary to discuss the algorithms proposed in following chapters. First, detection of network anomalies is briefly introduced. Following that, the human immune system, artificial immune sys- tem, machine learning and their relevance are reviewed and discussed. Then, popular datasets used for experiments in the thesis are examined. related works. In Chapter 2, a combination method of selection algorithms is presented. The proposed technique helps to reduce detectors storage generated in training phase. Test- ing time, an important measurement in IDS, will also be reduced as a direct consequence of a smaller memory complexity. Tree structure is used in this chapter (and in Chapter 5) to improve time and memory complexities. A complete and nonredundant detector set, also called perfect detectors set,
4 is necessary to archive acceptable self and nonself coverage of classifiers. A selection algorithm to generate a perfect detectors set is investigated in Chapter 3. Each detector in the set is a string concatenated from overlapping classical ones. Different from approaches in the other chapters, discrete structure of string-based detectors in this chapter are suitable for detection in distributed environment. Chapter 4 includes two selection algorithms for fast training phase. The optimal algorithms can generate a detectors set in linear time with respect to size of training data. The experiment results and theoretical proof show that proposed algorithms outperform all existing ones in term of training time. In term of detection time, the first algorithm and the second one is linear and polynomial, respectively. Chapter 5 mainly introduces a hybrid approach of positive selection algorithm with statistics for more effective NIDS. Frequencies of self and nonself data (strings) are contained in leaves of trees representing detectors. This information plays an important role in improving performance of the proposed algorithms. The hybrid approach came as a new positive selection algorithm for two-class classification that can be trained with samples from both self and nonself data types.
5 Chapter 1 BACKGROUND The human immune system (HIS) has successfully protected our bodies against attacks from various harmful pathogens, such as bacteria, viruses, and parasites. It distinguishes pathogens from self-tissue, and further eliminates these pathogens. This provides a rich source of inspiration for computer security systems, especially intrusion detection systems [92]. Hence, applying theoretical immunology and observed immune functions, its principles, and its models to IDS has gradually developed into a new research field, called artificial immune system (AIS). How to apply remarkable features of HIS to archive scalable and robust IDS is considered a researching gap in the field of computer security. In this chapter, we introduce the background knowledge necessary to discuss the algorithms proposed in following chapters that can partly fulfill the gap. Firstly, a brief introduction to network anomaly detection is presented. We then overview HIS. Next, immune selection algorithms, detectors, performance metrics and their relevance are reviewed and discussed. Finally, some popular datasets are examined. 1.1 Detection of Network Anomalies The idea of intrusion detection is predicated on the belief that an intruder’s behavior is noticeably different from that of a legitimate user and that many unautho- rized actions are detectable [65]. Intrusion detection systems (IDSs) are deployed as a second line of defense along with other preventive security mechanisms, such as user
6 authentication and access control. Based on its deployment, an IDS can act either as a host-based or as a network-based IDS. 1.1.1 Host-Based IDS A Host-Based IDS (HIDS) monitors and analyzes the internals of a computing system. A HIDS may detect internal activity such as which program accesses what resources and attempts illegitimate access, for example, an activity that modifies the system password database. Similarly, a HIDS may look at the state of a system and its stored information whether it is in RAM or in the file system or in log files or elsewhere. Thus, one can think of a HIDS as an agent that monitors whether anything or anyone internal or external has circumvented the security policy that the operating system tries to enforce [12]. 1.1.2 Network-Based IDS A Network-Based IDS (NIDS) detects intrusions in network data. Intrusions typically occur as anomalous patterns. Most techniques model the data in a sequential fashion and detect anomalous subsequences. The primary reason for these anomalies is the attacks launched by outside attackers who want to gain unauthorized access to the network to steal information or to disrupt the network. In a typical setting, a network is connected to the rest of the world through the Internet. The NIDS reads all incoming packets or flows, trying to find suspicious patterns. For example, if a large number of TCP connection requests to a very large number of different ports are observed within a short time, one could assume that there is someone committing a port scan at some of the computers in the network. Port scans mostly try to detect incoming shell codes in the same manner that an ordinary intrusion detection system does. In addition to inspecting the incoming traffic, a NIDS also provides valuable information about intrusion from outgoing or local traffic. Some attacks might even be staged from the inside of a monitored network or network segment; and therefore, not regarded as incoming traffic at all. The data available for intrusion detection systems can be at different levels of granularity, like packet level traces or Cisco netflow data.
7 The data is high dimensional, typically, with a mix of categorical as well as continuous numeric attributes. Misuse-based NIDSs attempt to search for known intrusive patterns while an anomaly-based intrusion detector searches for unusual patterns. Today, the intrusion detection research is mostly concentrated on anomaly-based network intrusion detection because it can detect both known and unknown attacks [12]. 1.1.3 Methods On the basis of the availability of prior knowledge, the detection mechanism used, the mode of performance and the ability to detect attacks, existing anomaly detection methods are categorized into six broad categories [41] as shown in Fig. 1.1. This figure is adapted from [12]. Supervised Parametric Learning Clustering Non-Parametric Unsupervised Learning Association Mining Outlier mining Probabilistic Learning ANN based Anomaly Rough Set based Detection Fuzzy Logic Soft GA based & Ant Colony Computing Artificial Immune System Rule based & Expert Knowledge System based based Ontology & Logic based Ensemble based Fusion based Combination Learners Hybrid Figure 1.1: Classification of anomaly-based intrusion detection methods AIS is a fairly new research subfield of Computational intelligence. It was considered as a system that acts intelligently: What it does is appropriate for its circumstances and its goal; it is flexible to changing environments and changing goals; it learns from experience; also it makes appropriate choices given perceptual limitations and finite computation [68].