YOMEDIA
ADSENSE
Summary of Computer dotoral thesis: Researching and proposing PSI graph as a feature for Botnet detection on IoT devices
10
lượt xem 3
download
lượt xem 3
download
Download
Vui lòng tải xuống để xem tài liệu đầy đủ
By analyzing the emerging needs as described above, this thesis specified the research target as to propose a feature having a novel yet efficient and low complex ity graph structure in detecting multi-arch IoT botnet with high accuracy.
AMBIENT/
Chủ đề:
Bình luận(0) Đăng nhập để gửi bình luận!
Nội dung Text: Summary of Computer dotoral thesis: Researching and proposing PSI graph as a feature for Botnet detection on IoT devices
- RESEARCHING AND PROPOSING PSI GRAPH AS A FEATURE FOR BOTNET DETECTION ON IOT DEVICES
- – ……………… …… …… ……
- TABLE OF CONTENTS INTRODUCTION ...................................................................................................................................... 1 1. The urgency of this thesis .................................................................................................................. 1 2. Research aim ..................................................................................................................................... 1 3. Research object and area ................................................................................................................... 1 4. Research outlines and methodology .................................................................................................. 1 5. Thesis layout...................................................................................................................................... 2 CHAPTER 1: THEORETICAL BASIS ..................................................................................................... 2 1.1. Definition and characteristics of IoT devices ................................................................................. 2 1.2. Definition of IoT botnet.................................................................................................................. 3 1.3. The evolution of IoT botnet............................................................................................................ 3 1.4. Comparison between traditional botnet and IoT botnet ................................................................. 3 CHAPTER 2. IOT BOTNET MALWARE DETECTION METHOD ...................................................... 4 2.1. Comparison of static and dynamic analysis ................................................................................... 4 2.2. Evaluation of IoT botnet detection methods based on static analysis ............................................ 5 2.2.1. Constructing dataset for experimental .................................................................................... 6 2.2.2. Experimental results and discussions ..................................................................................... 7 CHAPTER 3. PSI GRAPH FEATURE FOR DETECTION OF IOT BOTNET ....................................... 8 3.1. Statement of the problem................................................................................................................ 8 3.2. Explaination of the problem ........................................................................................................... 8 3.3. Proposed method ............................................................................................................................ 8 3.4. Function call graph in IoT botnet malware detection ..................................................................... 9 3.5. PSI Graph construction ................................................................................................................ 11 3.6. Experimental evaluation ............................................................................................................... 13 3.6.1. Experimental environment ................................................................................................... 13 3.6.2. Evaluation model .................................................................................................................. 13 3.6.3. Experimental results and discussion ..................................................................................... 14 CHAPTER 4. PSI-ROOTED SUBGRAPH FEATURE IN DETECTING IOT BOTNET ..................... 16 4.1. Statement of the problem.............................................................................................................. 16 4.2. Building PSI-rooted subgraph feaure ........................................................................................... 16 4.3. Experiment and evaluate the results ............................................................................................. 18 25
- 4.3.1. Experimental environment ................................................................................................... 18 4.3.2. Evaluation model .................................................................................................................. 18 4.3.2. Experimental results and discussion ..................................................................................... 19 CONCLUSIONS ...................................................................................................................................... 23 26
- INTRODUCTION 1. The urgency of this thesis The revolution of Industry 4.0, which is known as either Internet of Things or Industrial Internet, has a great impact on the industry of every nation. Although having several alternative name, the industrial 4.0 has the most significant characteristic which is known as the replacement of traditional production machines into fully-automated machines which were built on top of IoT devices. By applying the cutting edge technology of the Industry 4.0, humans are being able to take major leaps in almost every fields namely medical, education, economics,... Although the Industry 4.0 is providing undeniable benefits, it has posed a plenty of cyber security threats which may directly cause negative impact on national security and regional stability. Recent survey conducted on published articles from Elsivier, IEEE, Hindawi and Springer [6] suggested that authentication had been the most common solution in securing IoT devices while research in the field of trust management as well as lightweight cryptography and secure communication between IoT devices had being gained their popularity. Furthermore, botnet had been one of the most dangerous threats to IoT devices. Therefore, to meet the urgent demand of a real world problem in securing IoT devices, this thesis focused on researching and proposing a PSI graph which can be leveraged as a feature for botnet detection on IoT devices. 2. Research aim By analyzing the emerging needs as described above, this thesis specified the research target as to propose a feature having a novel yet efficient and low complex ity graph structure in detecting multi-arch IoT botnet with high accuracy. 3. Research object and area - Research object: the research objects of this thesis are multi-arch binary executables on IoT devices that operated on Linux Kernel 2.6 or 3.2. - Research area: this thesis focuses on reformulating malware detection as a binary classification problem with the following constraint: only research static analysis method for IoT botnet detection on IoT devices that have restricted resources (SOHO devices) such devices that have either low power consumption or small memory and limited computing capability. 4. Research outlines and methodology *) Research outlines: the thesis will focus on analyzing and evaluating some of the following contents: - Research the development, evolution and specification of IoT botnet and IoT botnet detection methods - Surveying, analyzing and evaluating existing IoT botnet detection methods that inherit from the static analysis on the same dataset and environment. - Researching and proposing a new graph-based feature that can be applied in the IoT botnet detection process. - Evaluating the proposed feature on accuracy and complexity in IoT botnet detection by using the reliable datasets as well as comparing the experimented results with others proposals which had the same approach. *) Research methodology Combining theoretical research with practical research 1
- - Theoretical research: researching, surveying, concluding, evaluating related works at a national and international scope to analyzing the remaining problems that can be solved by following the proposed method. Published articles had been collected from authorized sources such as: Google Scholar, Science Direct, ACM Digital Library, IEEE Xplore, industrial conferences namely Blackhat, USENIX, DEF CON, … In particular, focusing on theoretical research on the behavioral characteristics, infection life cycle of botnet malware, researching decompiled code fragments of sample sets executed on IoT devices. - Practical research: Based on a data set of more than 10000 samples, including botnet malware and benign samples on IoT devices and divided into training and testing set at the rate of 70:30, using cross- validation techniques, the thesis conducted experiments for constructing a feature for IoT botnet by applying IoT botnet detection on real world dataset, experimenting and evaluating the effectiveness of the proposed PSI graph feature with Deep Learning, experimenting and evaluating the effectiveness of the improved feature known as PSI-rooted subgraph with machine learning algorithm. 5. Thesis layout This thesis included the introduction along with 4 chapters and finished with a conclusion and requests. The appendix had 126 pages of illustration with 17 tables, 59 pictures, graph and 123 references. Introdution: practical urgency and structure of this thesis Chapter 1: Theoretical basis Chapter 2: IoT botnet detection methods Chapter 3: PSI graph as a feature in IoT botnet detection Chapter 4: PSI-rooted subgraph as a feature in IoT botnet detection Conclusions and requests Appendix CHAPTER 1: THEORETICAL BASIS 1.1. Definition and characteristics of IoT devices The term IoT - Internet of Things was firstly defined by Kevin Ashton - the founder scientist of Auto- ID at MIT. After that, there had been various definitions of IoT without a unified one. However, all of the existing definitions had been focusing on the connection between things (devices) via the Internet. Therefore, this thesis summarized the definition of IoT as “the platform consists of physical and logical things that can be integrated on applications, humans, environments and have the abilities to connect, transmit and process data for different purposes”. A recent survey of Statista suggested that the number of IoT devices is going to increase dramatically up to 75 billion devices in 2025, which will be 2.4 times as many as 2020. Furthermore, IoT devices have taken place everywhere, in every field such as the medical system, production management system, energy management system,... In the current research area, this thesis defined “IoT devices are both physical and logical multi-arch devices that have restricted computing resources and capabilities but have the ability to connect, transmit and process data for a specified purpose”. In general, most of the existing IoT devices operate on various distros of the UNIX operating system. The popularity of UNIX distros comes from its useful set of utilities. Therefore, this thesis only focus on leveraging Linux executables that exists in a common format known as ELF - Executable Linkable Format. 2
- Comparing with devices that operate based on traditional communication technology, IoT devices have several unique characteristics as follows: - Unsupervised operating environment: IoT devices has their own mobility and self-control. - Non-unified : Iot devices were built on top of various process architectures such as: MIPS, ARM, PowerPC, MIPSEL, … - Constrained resource: IoT devices often have limited storage and small memory. - Dynamic status: the status of IoT devices depends on their operating environment. - Connectivity: IoT devices can effortlessly connect to each other and interact with the information and communication infrastructure at a global scope. 1.2. Definition of IoT botnet Botnet is a type of malware that originated from the name “robot”, referred to as its automated operation. Botnet is an application that has the ability to automatically interact with other services on the network. Botnet is often designed to infect specified devices such as personal computers, mobile devices or IoT devices then turn these infected devices into a member of a larger network which was controlled by the attacker, known as bot-master. Botnet only executes its malicious activities after receiving the commands from C&C server. This is the main difference between botnets and other types of malwares. Therefore, this thesis defines the IoT botnet as “the botnet that has the ability to automatically infect on IoT devices and is controlled by attackers”. Figure 1.1. Relationship between some IoT botnet malware 1.3. The evolution of IoT botnet According to the analysis and evaluation of recent research of IoT malware as well as the experience in detecting real malware samples, this thesis summarized the evolution of IoT malware that was used for massive DDoS attacks into a graph. However, the completed list of IoT malware had not been finished since attackers always modified and updated their malwares to create novel instances everyday. 1.4. Comparison between traditional botnet and IoT botnet The comparison between traditional and IoT botnet are listed in the following table 1.1: 3
- Table 1.1. Compare botnet malware on traditional computers and IoT Criteria Traditional botnet on PC IoT botnet Leverage a huge number of IoT devices at Various attack types such as data Attack types a global scope to perform massive DDoS encryption, data theft, DoS attack,.. attacks Multi-arch, based on the variety of IoT Architecture Mainly focused on x86_64 devices: ARM, MIPS, PowerPC,... Low variety, mostly based on modification Variety High variety yet complex structure from traditional botnet Obfuscation Leverage the computing power to Simple obfuscation due to the limit of techniques perform complex obfuscation computational resources Detectable Easy to detect the footprints by Harder to detect the footprints due to the footprints analyzing the behavior of computers operation characteristic of IoT devices Harder to get a IoT botnet sample operate Executable Easier to analyzing on sandbox in sandbox due to the multi-arch capabilities constraints and activation conditions Infection Able to persist on the storage of Often delete persistence and only operate Capabilities computers on volatile memory Very competitive, due to the limited Not really competitive due to the large computational resources, IoT botnet often Competition amount of computational resource on deactivates or removes other malwares PC after successfully infected on IoT devices Conclusion of chapter 1: This chapter presented an introduction of IoT botnet including the definition of IoT devices and IoT botnet as well as the evolution and life cycle of IoT botnet. Furthermore, this chapter evaluated and compared between traditional botnet and IoT botnet and summarized a list of key differences between them. These insights provided solid arguments for determining the compatible IoT botnet detection method. CHAPTER 2. IOT BOTNET MALWARE DETECTION METHOD 2.1. Comparison of static and dynamic analysis Both static and dynamic analysis have certain advantages and limitations. Table 2.1 summarizes the advantages and disadvantages of each of the above methods. Table 2.1. Comparison of both method in IoT botnet malware detection Dynamic analysis Static analysis - Observe the execution of a program to - Analyze programs in detail and give an determine more specifically overview of all their activation - Dynamic analysis is more effective possibilities Advantages against obfuscation malware - No need to execute malwares, not affected by multi-architecture when building execution environment 4
- - Only single-threaded execution can - Depends heavily on decompilation be monitored techniques. - Disclose the process of detecting and - Difficulty handling malware using analyzing malwares obfuscation Disadvantage - May cause a threat to the network and the system - Difficult to fully emulate IoT devices (multi-architecture) To fit the research content, the thesis finds that with input as a multi-architectural executable file, it is necessary to choose a method capable of handling this problem effectively and efficiently, thus the thesis selects static analysis in proposing an approach to solving the research problem, in which the thesis exploits the strengths of static analysis and limits the weaknesses of this method. The next part of the thesis will focus on analyzing and evaluating current studies based on static analysis in the detection of IoT botnet malware. 2.2. Evaluation of IoT botnet detection methods based on static analysis Studies based on static analysis in malware detection often use common features such as: file headers, system-calls, API calls (Application Programming Interfaces), PSI ( Printable Strings Information), FLF (Function Length Frequency), linked libraries, OpCode (extracted from assembly code), ... Decompilation is a common approach to extract the above features from an executable file. The way of extracting and processing those features greatly affects the accuracy and complexity of the IoT malware detection methods, which can be divided into two groups: graph-based methods and non-graph-based methods, as illustrated in figure 2.1. Figure 2.1. Classification of static features in IoT botnet detection Malware detection methods use non graph-based features to build detection models that contain binary file structure attributes to classify a binary as malicious or benign. These methods are based on extracting features including Opcode, Strings, or a file structure with distinguishes malicious patterns. These features can be divided into two groups: high-level features and low-level features. In particular, low-level features can be gathered directly from within the file structure, whereas high-level features need to use disassembler tools such as IDA Pro or Radare2. Studies representing executable files with non graph-based features is heavily depend on the value of the features (e.g. function call inet_toa) and will not be able to describe complex semantic 5
- information interference between features (for example, data dependency in the lifecycle of IoT malware capable of distributed denial of service attack, referred to as IoT botnet). Besides, studies using non graph- based features usually cannot handle obfuscation malwares techniques such as encryption, junk data insertion... A comparison of IoT botnet malware detection methods based on static feature data representation summarized below shows state-of-art studies using static features in code detection. IoT botnet poisoning has limitations. - The studies following the direction of using typical Opcode data representation, such as Hamed HaddadPajouh [14], Ensieh Modiri Dovom [57], Darabian [52], Amin Azmoodeh et al. [36] uses key mechanisms such as identifying malicious code through opcode sequence, applying fuzzy pattern tree to detect malicious code pattern, detecting malicious code based on opcode frequency. These studies have limitations such as using only the sample set based on ARM architecture, and the dataset is not large enough. - The research of Mohannad Alhanahnah [4] represents data in Strings format that allows generating the word carefully to classify malicious code. However, the study was limited by the computational complexity and used only four types of malware. - Research by F. Shahzad et al. [96] represents data as an ELF header to extract features from the binary file's section to detect malicious code. However, the study was limited because the structure of the binary file was easily edited. - Research by Jiawei Su et al. [25] Grayscale image representation allows representing binary patterns as polymorphic grayscale images for malicious code detection. However, the study was limited because of the lack of precision when the samples used confusing or coding techniques. - Research by Hisham Alasmary et al. [32] represents CFG data to compute 23 graph theory properties of CFG to distinguish between malicious and malicious code samples. However, the study has computational complexity and inaccurate properties. Based on the evaluation of current studies on IoT botnet malware detection, we can see that all studies have advantages and disadvantages. However, each research method has been experimented on different datasets and environments. On that basis, the thesis conducted an objective assessment of current studies with the same testing environment and on the same dataset. The next part of the thesis will present in detail about the dataset, which is not only used to experiment for the evaluation in this Chapter but also used experimentally in the following chapters of the thesis. 2.2.1. Constructing dataset for experimental In order to reliably and properly serve the experimental studies of the thesis, the construction of a dataset consist of malware and benign executable files on the IoT device has an important significance. Table 2.2. Dataset description Family Name Variants Sample Number ARM MIPS Mirai 7 1,765 331 301 Bashlite 5 3,720 762 646 Other botnet 9 680 152 103 Benign - 3,845 561 533 Total 10,010 1806 1583 The dataset contains 10010 samples, including 6165 IoT botnet malware samples và 3845 IoT benign samples. It also has many kind of architectures such as ARM, MIPS, PowerPC, Sparc, SuperH,… 6
- The thesis uses the voting strategy [17], [81] to decide whether a file is malware or benign, as shown by equations. (2.1), in which 𝐸i (𝑆) is the ith malware detector on VirusTotal (eg Kaspersky, Norton, BKAV,...). 𝑀ã độ𝑐, 𝑛ế𝑢 (𝐸1 (𝑆) ∨ 𝐸2 (𝑆) ∨ … 𝐸𝑛 (𝑆)) (2.1) 𝐶𝑙𝑎𝑠𝑠𝑙𝑏 = { 𝐿à𝑛ℎ 𝑡í𝑛ℎ, 𝑛ế𝑢 (! (𝐸1 (𝑆)) ∧ ! (𝐸2 (𝑆)) ∧ … ! (𝐸𝑛 (𝑆)) To conduct the experiments in this Chapter, the thesis divides the dataset into 2 subsets with the ratio: 70% of the dataset is training set and the remaining 30% is for evaluation. The experiment is built with Python language and Scikit-Learn library on Ubuntu 16.04 platform using Intel Core i5-8500 chip, 3.0GHz and 32 GB RAM. 2.2.2. Experimental results and discussions The experimental results and evaluation of static features approaches in IoT malware detection are shown in Table 2.3. Here, the thesis re-employs the existing studies and re-uses exactly the classifiers that the studies used, so the classification algorithm is different. The purpose is to assess the reliability and accuracy of studies with the same dataset described in section 2.2.1. Table 2.3. Experimental results of static features approach in IoT malware detection Features FPR (False FNR (False extracting Static features Classification Classifier Accuracy Positive Negative and pre- approaches time Rate) Rate) processing time RIPPER 99,8 0,2 0,2 0,75s ELF-header [96] PART 99,8 0,2 0,2 1h50m 1,27s DT (J48) 99,6 0,5 0,3 1s SVM 98 0,9 2,2 12,4s kNN 99,8 0,4 0,2 1s String-based [70] 4m47s DT (J48) 99,4 0,4 0,6 8,75s RF 99.7 0,3 0,4 9,71s Image-based Neural 89,1 12,7 1,4 14m19s 2m19s [25] Network SVM 89 33,8 4,4 1,45s CFG-based [32] LR 85 15,1 19,0 5 days 0,5s RF 95 7,5 5,9 1,75s From the result table 2.3, it can be seen that the studies of representing the executable using non graph- based features is heavily depend on the value of the features (for example, function call inet_toa) and cannot describe complex semantic information between features (for example, data dependency in the life cycle of an IoT malware capable of a distributed denial of service attack, referred to as the IoT botnet). Besides, studies using non graph-based features are often quite weak with obfuscation techniques such as encoding, data insertion. Meanwhile, the graph-based approach can generally evaluate and represent structured information, complex information of botnets behavior. 7
- Chapter 2 conclusion: The results of this Chapter provide motivation for the proposed methods of the thesis with the possible of the static analysis in IoT botnet malwares detecting problem. Moreover, graph-based features bring high efficiency and prospects in detecting IoT botnet malware. Contributions of Chapter 2: Evaluate and compare the difference between botnet malware on traditional computers and IoT devices, thereby serving as a basis to propose a suitable static analysis method for detecting IoT botnet malware; Building a reliable dataset for experiment in IoT botnet malware detection; Re-experiment and evaluation of current studies based on static analysis with the same dataset and experimental environment. These results have been published and presented in the Proceedings of Conferences and prestigious journals (at [B3], [B4], [B5] in the author's list of works). CHAPTER 3. PSI GRAPH FEATURE FOR DETECTION OF IOT BOTNET 3.1. Statement of the problem The research problem in this chapter is defined as: - Let 𝐿 = {𝑙1 , 𝑙2 , … , 𝑙𝑛 }, the set of 𝑛 executable files, in which 𝑙𝑖 ∈ {0,1} can be malicious executable files (value 1), or benign executable files (value 0) with 𝑖 = ̅̅̅̅̅ 1, 𝑛 - Let 𝐹 = {𝑓𝐴𝑙ℎ𝑎𝑛𝑎ℎ𝑛𝑎ℎ , 𝑓𝑆𝑢 , 𝑓𝐻𝑎𝑑𝑑𝑎𝑑 , 𝑓𝐴𝑧𝑚𝑜𝑜𝑑𝑒ℎ𝑃𝑎𝑗𝑜𝑢ℎ , 𝑓Alasmary }, a set of feature for the detection of botnet in IoT devices and have good results in recent years. Therefore, ∃𝑓𝑇𝑟𝑢𝑛𝑔 ∉ 𝐹 such that 𝑓𝑇𝑟𝑢𝑛𝑔 (𝐿) is simpler than 𝑓𝑗 ∈ 𝐹, in terms of the graph structure, the simpler is quantified through the number of edges and the number of graph vertices. Although simpler in terms of structure, but 𝑓𝑇𝑟𝑢𝑛𝑔 gives better results than 𝑓𝑗 ∈ 𝐹 in terms of accuracy, execution time. 3.2. Explaination of the problem The thesis chooses an approach based on static analysis in detecting botnet malware on IoT devices. Currently, there have been studies following this approach, such as Alhanahnah et al. [4], Su et al. [25], HaddadPajouh et al. [14], Azmoodeh et al. [36], Hisham Alasmary et al. [33]. Specifically, Mohannad Alhanahnah et al. Combine various static features such as strings, control flow graph (CFG) and file structure statistics to generate the signatures used for classification of multi-architecture IoT malware. Su et al. proposed a lightweight method to distinguish IoT malicious patterns from IoT benign patterns based on grayscale images, and by feeding these gray-scale images into the convolutional neural network model to detect IoT malware. Hamed HaddadPajouh et al., Azmoodeh et al. Proposed a method of detecting IoT malware using opcode sequences. Hisham Alasmary et al. Performed an in-depth study of the graph of Android malware and IoT botnet. With a detailed description of the characteristics of typical studies in static analysis to detect botnet malware on IoT devices, the new feature in the statement of the research problem in this Chapter will take advantage of the strong points. as well as solving the limitations of existing features, thereby bringing high efficiency in the problem of detecting IoT botnet malware with machine learning and deep learning algorithms. 3.3. Proposed method The research problem in this Chapter of the thesis will follow the following assumptions: The basic difference between botnet malware and other types of malicious code is that botnet always need a connection to C&C server to send/receive attack command from hacker. The infection and attacks of botnet malware on IoT devices have been studied a lot, and found that they often follow the general process. Each step in the life 8
- cycle of the IoT botnet malware usually involves information represented in the form of strings such as hacker instruction commands, IP addresses / domain names of C&C servers, etc. Before going into the proposed methodology explaination, to better understand the problem, the following definitions are specified in this thesis. Definition 3.1: A function-call graph is a directed graph, represented by 𝐺 = (𝑉, 𝐸). Where 𝑉 is the set of vertices 𝑉 = 𝑉(𝐺) representing functions and the set of edges 𝐸 = 𝐸(𝐺), where 𝐸(𝐺) ⊆ 𝑉(𝐺) × 𝑉(𝐺), corresponding to the function calls. For each vertex 𝑣 ∈ 𝑉, the two defined functions 𝑉𝑛 (𝑣) và 𝑉𝑓 (𝑣) provide the function name and function type of the function represented by 𝑣. The function type 𝑡 ∈ {0,1} can be a local function (value 0) or an extension function (value 1). Definition 3.2: A Printable String Information (PSI) is a printable string of information appearing in the executable, either explicitly (eg “10.1.1.2”) or encrypted (eg “eGAIM”). Definition 3.3: The PSI graph is a directed graph, represented as follows 𝐺𝑃𝑆𝐼 = (𝑉, 𝐸), where: – 𝑉 is the vertices set composed of function which are in function-call graph and contain PSIs, – 𝐸 is the set of directed edges {(𝑉𝑖 ,𝑉𝑗 ), (𝑉𝑘 ,𝑉ℎ ), … } , reflecting the caller-callee relationship between two functions To prove the above hypothesis and answer the proposed research problem, the proposed method in this thesis has the following structure diagram: Figure 3.1. The workflow of proposed method to detect IoT botnet malware The thesis provides a general model of the proposed method, including 02 main processing phases: training phase and utilizing phase, illustrated in figure 3.1. In particular, the training phase and the use phase have a relatively similar treatment process, only different in Class Classification. With the input data being executable files on the IoT device, including malicious and benign files, the implementation process consists of four steps as follows: FCG graph generation, PSI graph generation , Preprocess and Feature Selection phase and finally Classifier. 3.4. Function call graph in IoT botnet malware detection A Function Call Graph (FCG) is a control flow graph, which represents the relational call between functions or subfunctions in an executing program. The formal definition is shown as definition 3.1. 9
- Before constructing the function call graph, it is necessary to check and pre-process the defense techniques of the executable files to ensure the correctness of the function call graph. To check whether the files use encapsulation techniques, the thesis uses the DiE tool (Detect It Easy) [115] to check whether the files are packed and if so, what is the used packaging technique. Analysis on the thesis's test the dataset of 10010 samples found that only about 2% of the samples used obfuscation techniques, and the vast majority were UPX packaging techniques. After a packer has been identified, there are many tools that support unpacking, such as the UPX tool [121]. The executables that cannot be unpacked using the UPX tool will be removed from the dataset. After the unpack process, the thesis used IDA Pro as a tool to support decompilation because it is a cross-platform support tool. After performing the decompiling of the binary files with IDA Pro tool, the thesis obtained the assembly code of the file. The algorithm for building function call graph (algorithm 3.1) is deployed by the thesis inheriting from the research of Ming Xu et al. [108]. Algorithm 3.1: Constructing the Function Call Graph – FCG Input: Functions of executable file 𝐹, Output: The function call graph 𝐺𝐹 of the executable file 𝐹 // Initialization 1: 𝑮𝑭 .𝑉 = 𝝓 and 𝑮𝑭 .𝐸 = 𝝓 2: EntryFuncSet = 𝝓, FuncSet = 𝝓, FuncQ = 𝝓, VerSet = 𝝓 // Extracting functions from assembly code 3: FuncSet = SplitFuncs(𝐹) 4: EntryFuncSet = IdentifyEntryPointFuncs(𝑀) 5: FuncQ = InitQ(EntryFuncSet) // Building a caller-callee relationship 6: while(FuncQ is not empty) 7: baseVertex = Dequeue(FuncQ) 8: Insert baseVertex in 𝑮𝑭 9: baseVertex.enQFlag = true //Extracting the callee of set baseVertex 10: VerSet = getCallee(baseVertex) 11: for each vertex in VerSet 12: if((vertex ∩ FuncSet) ≡ 𝝓) // The vertices are not in FuncSet 13: continue 14: endif 15: headVertex = vertex 16: // Build the connecting edge between baseVertex and headVertex 17: if(𝑒 ∈ 𝑮𝑭 .𝐸) 18: baseVertex.outDeg++ 19: headVertex.inDeg++ 20: else 21: Insert headVertex in 𝑮𝑭 22: Insert edge 𝑒 in 𝑮𝑭 23: endif 24: if(headVertex.enQFlag == false) 25: Enqueue headVertex in FuncQ 26: headVertex.enQFlag = true 27: endif 28: next vertex 29: end while 30: return 𝑮𝑭 31: end 10
- The Call Graphs are still highly complex due to the large number of vertices and edges, and are often expensive to compute and to store [97]. If the complexity of a graph is based on the number of edges and vertices then the complexity will be 𝛰(|𝑉| ∗ |𝐸|) where |𝑉| is the number of edges and |𝐸| is the number of vertices. Therefore, based on the function call graph, the thesis aims to build a new graph feature with high efficiency (low complexity when it is possible to reduce the number of vertices and edges of the graph feature but still ensures high detection rate) in the problem of detecting IoT botnet malware when applied to machine learning and deep learning techniques. 3.5. PSI Graph construction Before building the PSI graph (definition 3.3), the thesis extracts all PSI (definition 3.2) existing inside the executable file with a plugin code of the IDAPro tool. Balancing the accuracy of classification results and computational complexity, the thesis chooses PSI functions with a minimum length of 3 characters or more. These PSIs can be in either explicit or encrypted form and often contain a lot of semantic information relevant to the attacker's intent. After constructing the function call graph, as well as identifying vertices containing PSI, the dissertation proceeds to browse the function call graph to construct PSI graph, the implementation process is as in algorithm 3.2. Algorithms 3.2: PSI-Graph Generation (FCG) 1 𝑉 = [ ], 𝐸 = [ ] 2 For each vertice 𝑣𝑖 in FCG do: 3 If exist psi in 𝑣𝑖 and do: 4 𝑉 = 𝑉 ∪ 𝑣𝑖 5 End if 6 For each edge 𝑒𝑗 (𝑣𝑖 , 𝑣𝑘 ) do: 7 If exist psi in 𝑣𝑘 and 𝑣𝑘 ∉ 𝑉 and 𝑒𝑗 (𝑣𝑖 , 𝑣𝑘 ) ∉ 𝐸 do: 8 𝑉 = 𝑉 ∪ 𝑣𝑘 9 𝐸 = 𝐸 ∪ 𝑒𝑗 (𝑣𝑖 , 𝑣𝑘 ) 10 End If 11 Enf for 12 End for 13 Return 𝑉, 𝐸 The PSI graphing process is based on trimming FCG graph to reduce the number of edges and the number of vertices, so the complexity of the PSI graph generation algorithm is 𝑂(|𝑉| ∗ |𝐸|) as well. will decrease. Table 3.1 shows the size comparison between PSI graph and function call graph. As can be seen, the PSI graph has a much smaller size than the function call graph in terms of the number of vertices and edges in both malicious and malicious files. Therefore, using the PSI graph as featured to detect malicious code can reduce complexity (increase processing speed, reduce computation time cost) compared to using function call graph. Table 3.1. Comparison between the PSI graph and the call graph of the FCG function Class Average number Average number Average number Average number of vertices in PSI of edges in PSI of vertices in of edges in FCG graph graph FCG Maliciousness 147.1 1110.5 254.5 3075.5 Benignness 167.8 1693.9 530.9 2962.2 11
- As can be seen in Figure 3.2, the number of vertices in PSI graph is concentrated mainly in the range [1, 300] for both malicious and benign files. Although there is a slight difference in distribution, this difference is not obvious enough to establish a threshold value to distinguish between benign and IoT malicious samples. Figure 3.2. Number of edges and vertices between sample patterns In order to easily visualize the operation results of the PSI graph generation algorithm, Figure 3.3 shows an example of the function call graph of the Linux.Bashlite pattern, it can be clearly seen that the PSI graph is much simpler than the graph function call. On average, a PSI graph contains only about 16 vertices and 60 edges compared to the 156 vertices and 360 edges of the function call graph. Figure 3.3. Function call graph (left) and PSI graph (right) of Linux.Bashlite malware sample In summary, the PSI graph characteristics obtained by the thesis have the following characteristics: - Be built based on static method; - Can reflect "lifecycle behavior" or can be called as simulation of infection process of IoT botnet malware; - Only consider the structure of printable string information (PSI), not consider the value of the strings; 12
- - Be built based on function call graph. 3.6. Experimental evaluation 3.6.1. Experimental environment Using the experimental data set presented in section 2.2.1 of this thesis summary, to conduct the experiments, the thesis divides the dataset into two subset: training set and testing set. The training set contain an equal number of 2690 samples for both the malicious and the benign classes. The test subset contains 4630 samples. The experiment is built with Python and PyTorch framework on Ubuntu 16.04 operating system using Intel Core i5-8500, 3.0GHz chip, NVIDIA GeForce GTX1080Ti graphics card and 32 GB RAM. 3.6.2. Evaluation model To evaluate the effectiveness of PSI graph features in the IoT botnet malware detection problem, the thesis feeds PSI graph features into the evaluation model as shown in Figure 3.4. The thesis aims at approach based on the analysis and representation of the entire structure of the PSI graph into fixed-length numerical vector values, so the thesis uses graph2vec [39] in the data preprocessing process. Figure 3.4. Evaluation model of detecting IoT botnet malware using PSI Graph Graph2vec is an unsupervised learning technique for converting a graph into a digital vector. Graph2vec is based on the idea of a doc2vec approach [82] using the skip-gram network. Graph2vec learns to represent graphs by treating an entire graph as a text and subgraphs as the words that make up that text. Thuật toán 3.3: Graph2vec (𝒢, 𝐷, 𝛿, 𝔢, 𝛼) Input: 𝒢 = {𝐺1 , 𝐺2 , … , 𝐺𝑛 }: Set of graphs such that each graph 𝐺𝑖 = (𝑉𝑖 , 𝐸𝑖 , 𝜆𝑖 ) for which embedding have to be learnt 𝐷: Maximun degree of rooted subgraphs to be considered for learning embeddings. This will produce a vocabulary of subgraphs, 𝑆𝐺𝑣𝑜𝑐𝑎𝑏 = {𝑠𝑔1 , 𝑠𝑔2 , … } from all the graphs in 𝒢 𝛿: number of dimensions (embedding size) 𝔢: number of epochs 𝛼: Learning rate Output: Matrix of vector representation of graphs Φ ∈ ℝ|𝒢| × 𝛿 1: Initialization: Sample Φ from ℝ|𝒢| × 𝛿 2: for 𝔢 = 1 to 𝔢 do 3: 𝜔 = 𝑆h𝑢𝑓𝑓𝑙𝑒(𝒢) 4: for each 𝐺𝑖 ∈ 𝜔 do 5: for each 𝑣 ∈ 𝑉𝑖 do 6: for 𝑑 = 0 to 𝐷 do 13
- 7: (𝑑) 𝑠𝑔𝑣 := GetWLSubgraph(𝑣, 𝐺𝑖 , 𝑑) 8: (𝑑) 𝒥(Φ) = − log Pr( 𝑠𝑔𝑣 |Φ(𝒢)) 9: 𝜕𝒥 Φ = Φ − 𝛼 𝜕Φ 10: Return Φ The working principle of graph2vec is as follows: the entire graph is treated as a document, then the subgraphs in the graph in question are treated as sentences where each vertex in the graph is processed as a word. Then the document is built by using the graph traverse technique. Once the document has been built, use the skipgram technique to represent this graph. Due to having to predict subgraphs, that is, graphs with similar subgraphs and similar structures have similar embedding. The result of this step is a set of one-hot vectors of arbitrary length representing the set of graphs. In the proposed study, the thesis presents PSI graphs as numerical vectors of 1024 length and used for later classification. The data collected after the PSI graph preprocessing step will be used to decide whether a file is malicious using the deep neural network classifier. To build convolutional neural networks, the thesis inherits the network model proposed by Kim [75]. The first layer of the neural network is the input layer, the next layer performs convolution operations using multiple filter sizes. The output of this class is passed to a nonlinear function, called the ReLU trigger, defined as 𝑓(𝑥) = max(0, 𝑥), because the ReLU trigger has a simpler computation. compared with the sigmoid activation function (this usually requires an exponential computational complexity) [100]. Next, the max-pooling class is used to reduce the data dimension from the convolutional layer, so the complexity and computational resources of the processing can be reduced and data scalable. Finally, the fully connected layer performs subclassing the outputs generated from the convolution layer and the pooling class. 3.6.3. Experimental results and discussion In order to evaluate the effectiveness of features of PSI graph in detecting IoT botnet malware, the thesis experimented and gave a result table in which focus on 02 features: PSI graph and FCG graph features with Measurement metrics include accuracy, FNR, FPR and cost of processing time. Table 3.2. The results of detecting IoT botnet malware by PSI graph and function call graph Metric Accuracy FNR FPR Time (m) Features (%) (%) (%) PSI-graphs 98,7 1,83 0,78 88 FCGs 95,3 5,81 4,13 545 From the results in Table 3.2, it can be seen that the proposed method using PSI graph features performs better than the function call graph. The results showed that the proposed method achieved 1.7% higher accuracy than using the call graph, and the execution time was also 457 minutes less. Besides, the false negative rate (false nagative/false elimination rate) in the proposed method is 1.83% while the FCG method is 5.81%. Meanwhile, with malware detection problems, the lower the false negative rate, the lower the classifier misdetecting the malicious code as benign files. Besides, the proposed method of the thesis still has a very small rate of error in wrongly labeling benign files as malicious code. This occurs in some benign files having a PSI graph structure similar to that of some Linux.Bashlite malware samples. Manually analyzing those sample sets found that the different executables, the FCG graph and the resulting assembly code were different but still had the same PSI graph structure. However, this false detection rate is only 0.78%, a very small percentage.. Table 3.3. Comparison between the IoT botnet detection methods Methods Algorithms Dataset Accuracy (%) Su et al. [25] Deep neural network (CNN) 95.13 14
- Methods Algorithms Dataset Accuracy (%) HaddadPajouh et Dataset described in Recurrent neural network (RNN) 97.88 al. [14] section 2.2.1 includes 6943 samples (of PSI-Graph Deep neural network (CNN) which 3098 botnet 98.7 from IoTPOT) From the result table 3.3, it can be seen that the research methods of Su et al. [25], HaddadPajouh et al [14] all showed promising results. Although the results of the current studies are promising, the lack of test data sets and the source code of the test models makes retesting and evaluating them quite difficult. This thesis tries to rebuild those methods through the materials, published articles of the above methods. The results showed that the proposed method of the thesis achieved better accuracy than that of Su and HaddadPajouh at 3.57% and 0.82%, respectively. Table 3.4. Evaluation over-fitting Methods Algorithms Dataset Accuracy (%) Dataset described in section 2.2.1 includes 10,010 PSI-Graph Deep neural network (CNN) samples (of which 6165 97,8 botnet IoTPOT and VirusShare) Finally, over-fitting problems often occur with deep learning algorithms. This occurs when the model too matches the training data set but does not perform well when it executes it on the extended subsets. To evaluate the over-matching problem in the proposed model, the thesis added 3067 malicious code samples collected from VirusShare to the test set and recalculated the accuracy. As shown in Table 2.4, when adding malicious code samples from VirusShare to the sample data set, the detection accuracy of malicious code decreased slightly (down 0.9%). Thus, from the experimental results, the thesis finds that the proposed method achieves good results in detecting IoT malware, and at the same time solving the problem of over-fitting in the acceptable range. Conclusion Chapter 3 Based on the analysis and evaluation of the characteristics of the IoT botnet malware and in order to solve the limitations of previous studies in detecting the botnet IoT malware based on the feature of the graph structure, the thesis proposed a high-level feature-based light approach, called the PSI graph, to detect the IoT botnet malware. The proposed method of mining the life cycle of IoT botnet malware to generate PSI graph characteristics, applying the advantages of deep learning method to achieve accuracy up to 98.7% with the same degree of overlap in the handicap range. received with the problem of detecting IoT botnet malware. However, the proposed method only focuses on exploiting the overall structure of the PSI graph, and still has a rather large time cost complexity. Contributions of Chapter 3 Proposing a new feature with a graph structure, effective in detecting multi-architectural botnet malware on IoT devices, called PSI graph. The research results have been published and presented in the Proceedings of Conferences and prestigious journals domestically and internationally (at [B1], [B6], [B7] in the list of works of the author). 15
- CHAPTER 4. PSI-ROOTED SUBGRAPH FEATURE IN DETECTING IOT BOTNET 4.1. Statement of the problem The method of detecting IoT botnet malware based on PSI graph features has shown high feasibility and efficiency. However, this proposed method focuses on exploiting the overall structure of the PSI graph and does not exploit the paths in the PSI graph, in other words the method focuses on considering the PSI graph as a graph. application. The fact that the growing trend of botnet malware executables on IoT devices is getting more and more complex is the fact that the structure of the PSI Graph will also be complex. Meanwhile, the malicious behaviors that often appear in the life cycle of the IoT botnet malware can be the paths in the PSI graph, illustrated in Figure 4.1, it can be the green or red paths, while the other routes are redundant data. Based on that, the research problem of this Chapter is stated as follows: Building a new feature based on PSI graph features, but focusing on exploring paths in PSI graphs, thereby building the characteristic. Displaying a new graph, called PSI-rooted subgraph representing malicious behavior of IoT botnet malware, improving efficiency of detecting IoT botnet malware with simple machine learning algorithms. Figure 4.1. Illustration the problem idea using a PSI-rooted subgraph 4.2. Building PSI-rooted subgraph feaure Definition 4.1 (PSI-rooted subgraph): Let 𝐺𝑠𝑔 = (𝑉, 𝐸, 𝜃, 𝑑) represents an acyclic directed PSI- Rooted sub-graph that is generated from 𝐺𝑃𝑆𝐼 rooted at vertex 𝜃; where 𝑉 𝜖 𝐺𝑃𝑆𝐼 is the set of vertexes whereas the length between (𝜃, 𝑉𝑖 ) satisfy 0 ≤ (𝜃, 𝑉𝑖 ) ≤ 𝑑, and E is a set of directed edges between vertexes in 𝑉. After building PSI graph, as well as identifying vertices in PSI, the dissertation proceeds to traverse PSI graph with each vertices as the root in PSI graph, implementation progress is shown in algorithm 4.1. Algorithm 4.1: 𝐸𝑥𝑡𝑟𝑎𝑐𝑡𝑅𝑜𝑜𝑡𝑒𝑑𝑆𝑢𝑏𝑔𝑟𝑎𝑝ℎ(𝒢, 𝐷) 𝒢 = {𝐺1 , 𝐺2 , … , 𝐺𝑛 }: Set of PSI graphs 𝐺𝑖 = (𝑉𝑖 , 𝐸𝑖 ), representation for ELF files 𝐈𝐧𝐩𝐮𝐭 𝐷: Maximum degree of PSI-rooted subgraph 𝒮𝒢 = {𝑆𝐺1 , 𝑆𝐺2 , … , 𝑆𝐺𝑛 }: Set of PSI-rooted subgraph 𝑆𝐺𝑖 = (𝑉𝑖′ , 𝐸𝑖′ , 𝑣, 𝐷) extracted Output from 𝒢 1: 𝑰𝒏𝒊𝒕𝒊𝒂𝒍𝒊𝒛𝒂𝒕𝒊𝒐𝒏: 𝒮𝒢 = ∅ 2: 𝒇𝒐𝒓 𝒆𝒂𝒄𝒉 𝐺𝑖 ∈ 𝒢 𝒅𝒐 3: 𝒇𝒐𝒓 𝒆𝒂𝒄𝒉 𝑣 ∈ 𝑉𝑖 𝒅𝒐 16
ADSENSE
CÓ THỂ BẠN MUỐN DOWNLOAD
Thêm tài liệu vào bộ sưu tập có sẵn:
Báo xấu
LAVA
AANETWORK
TRỢ GIÚP
HỖ TRỢ KHÁCH HÀNG
Chịu trách nhiệm nội dung:
Nguyễn Công Hà - Giám đốc Công ty TNHH TÀI LIỆU TRỰC TUYẾN VI NA
LIÊN HỆ
Địa chỉ: P402, 54A Nơ Trang Long, Phường 14, Q.Bình Thạnh, TP.HCM
Hotline: 093 303 0098
Email: support@tailieu.vn