UNSW-NB15数据集介绍

王茂南

3466
文章

75
评论

2019年9月3日07:05:53

评论14 7838字阅读26分7秒

摘要这一篇文章主要对UNSW-NB15数据集进行介绍. 这个数据集也是用来做入侵检测的数据集. 这里主要会结合论文, 介绍一下UNSW-NB15这个数据集的特征和数据的分布.

文章目录(Table of Contents)

简介

这一篇介绍关于UNSW-NB15数据集的相关内容, 也是关于入侵检测的一个数据集. 这里主要会对这个数据集进行介绍. 之前我们对另一个入侵检测的数据集进行过介绍, 链接如下: KDD99数据集与NSL-KDD数据集介绍

UNSW-NB15总体介绍

数据集的官网: The UNSW-NB15 Dataset Description

数据集下载链接: UNSW-NB15 Download

数据集中一共有9种攻击: This dataset has nine types of attacks, namely, Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode and Worms.

数据集一共有49个特征, 我们会在后面对每一种特征进行介绍.

在csv中保存的数据共有2,540044条数据, 被包含在四个文件中: The total number of records is two million and 540,044 which are stored in the four CSV files.

这里包含了每一种攻击的数量, 后面会做简单的分析: UNSW-NB15_LIST_EVENTS.csv.

该数据集已经进行了训练集和测试集的分割, 文件分别如下: UNSW_NB15_training-set.csv and UNSW_NB15_testing-set.csv.

在训练集中共有175341条记录, 在测试集中共有82332条记录. The number of records in the training set is 175,341 records and the testing set is 82,332 records from the different types, attack and normal.Figure 1 and 2 show the testbed configuration dataset and the method of the feature creation of the UNSW-NB15, respectively.

UNSW-NB15特征介绍

数据集共有49个特征, 下面分别进行介绍, 这里的内容来源为:

Moustafa, Nour, and Jill Slay. "UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set)." In 2015 military communications and information systems conference (MilCIS), pp. 1-6. IEEE, 2015.

关于下面的数据介绍中, Type的简写的对于关系分别如下所示:

N: nominal,
I: integer,
F: float,
T: timestamp,
B: binary

Flow Features

#, Name, Type, Description
------------------------------
1. srcip, N, Source IP address
2. sport, I, Source port number
3. dstip, N, Destination IP address
4. dsport, I, Destination port number
5. proto, N, Transaction protocol

Base Features

6, state, N, The state and its dependent protocol, e.g. ACC, CLO, else (-)
7, dur, F, Record total duration
8, sbytes, I, Source to destination bytes
9, dbytes, I, Destination to source bytes
10, sttl, I, Source to destination time to live
11, dttl, I, Destination to source time to live
12, sloss, I, Source packets retransmitted or dropped
13, dloss, I, Destination packets retransmitted or dropped
14, service, N, http, ftp, ssh, dns ..,else (-)
15, sload, F, Source bits per second
16, dload, F, Destination bits per second
17, spkts, I, Source to destination packet count
18, dpkts, I, Destination to source packet count

Content Features

19, swin, I, Source TCP window advertisement
20, dwin, I, Destination TCP window advertisement
21, stcpb, I, Source TCP sequence number
22, dtcpb, I, Destination TCP sequence number
23, smeansz, I, Mean of the flow packet size transmitted by the src
24, dmeansz, I, Mean of the flow packet size transmitted by the dst
25, trans_depth, I, the depth into the connection of http request/response transaction
26, res_bdy_len, I, The content size of the data transferred from the server's http service.

Time Features

27, sjit, F, Source jitter (mSec)
28, djit, F, Destination jitter (mSec)
29, stime, T, record start time
30, ltime, T, record last time
31, sintpkt, F, Source inter-packet arrival time (mSec)
32, dintpkt, F, Destination inter-packet arrival time (mSec)
33, tcprtt, F, The sum of 'synack' and 'ackdat' of the TCP.
34, synack, F, The time between the SYN and the SYN_ACK packets of the TCP.
35, ackdat, F, The time between the SYN_ACK and the ACK packets of the TCP.

The features from 1-35 represent the integrated gathered information from data packets. The majority of features are generated from header packets as reflected above.

Additional Generated Features--General purpose features

In the general purpose features, each feature has its own purpose, according to the defence point of view.

36, is_sm_ips_ports, B, If source (1) equals to destination (3)IP addresses and port numbers (2)(4) are equal, this variable takes value 1 else 0
37, ct_state_ttl, I, No. for each state (6) according to specific range of values for source/destination time to live (10) (11).
38, ct_flw_http_mthd, I, No. of flows that has methods such as Get and Post in http service.
39, is_ftp_login, B, If the ftp session is accessed by user and password then 1 else 0.
40, ct_ftp_cmd, I, No of flows that has a command in ftp session.

Additional Generated Features--Connection features

Connection features are solely created to provide defence during attempt to connection scenarios.

The attackers might scan hosts in a capricious way. For example, once per minute or one scan per hour . In order to identify these attackers, the features 36-47 are intended to sort accordingly with the last time feature to capture similar characteristics of the connection records for each 100 connections sequentially ordered.

41, ct_srv_src, I, No. of connections that contain the same service (14) and source address (1) in 100 connections according to the last time (26).
42, ct_srv_dst, I, No. of connections that contain the same service (14) and destination address (3) in 100 connections according to the last time (26).
43, ct_dst_ltm, I, No. of connections of the same destination address (3) in 100 connections according to the last time (26).
44, ct_src_ ltm, I, No. of connections of the same source address (1) in 100 connections according to the last time (26).
45, ct_src_dport_ltm, I, No of connections of the same source address (1) and the destination port (4) in 100 connections according to the last time (26).
46, ct_dst_sport_ltm, I, No of connections of the same destination address (3) and the source port (2) in 100 connections according to the last time (26).
47, ct_dst_src_ltm, I, No of connections of the same source (1) and the destination (3) address in in 100 connections according to the last time (26).

Labelled Features

48, attack_cat, N, The name of each attack category. In this data set, nine categories (e.g., Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode and Worms), 一共9种攻击, 算上Normal是一共有10个类别.
49, Label, B, 0 for normal and 1 for attack records

UNSW-NB15数据介绍

数据集的分布介绍

It represents the distribution of all records of the UNSW-NB15 data set. The major categories of the records are normal and attack. The attack records are further classified into nine families according to the nature of the attacks.

(1)Normal: 2,218,761; Natural transaction data.
(2)Fuzzers: 24,246; Attempting to cause a program or network suspended by feeding it the randomly generated data. (模糊攻击)
(3)Analysis: 2,677; It contains different attacks of port scan, spam and html files penetrations.
(4)Backdoors: 2,329; A technique in which a system security mechanism is bypassed stealthily to access a computer or its data.
(5)DoS: 16,353; A malicious attempt to make a server or a network resource unavailable to users, usually by temporarily interrupting or suspending the services of a host connected to the Internet.
(6)Exploits: 44,525; The attacker knows of a security problem within an operating system or a piece of software and leverages that knowledge by exploiting the vulnerability.
(7)Generic: 215,481; A technique works against all blockciphers(分组密码) (with a given block and key size), without consideration about the structure of the block-cipher.
(8)Reconnaissance(侦察): 13,987; Contains all Strikes that can simulate attacks that gather information.
(9)Shellcode: 1,511; A small piece of code used as the payload in the exploitation of software vulnerability.
(10)Worms: 174; Attacker replicates itself in order to spread to other computers. Often, it uses a computer network to spread itself, relying on security failures on the target computer to access it.

UNSW-NB15文件介绍

Four CSV files of the data records are provided and each CSV file contains attack and normal records. The names of the CSV files are UNSWNB15_1.csv, UNSW-NB15_2.csv, UNSW NB15_3.csv and UNSW-NB15_4.csv.

In each CSV file, all the records are ordered according the last time attribute. Further, the first three CSV files each file contains 700000 records and the fourth file contains 440044 records.

The list of event file is labelled UNSWNB15_LIST_EVENTS which contains attack category and subcategory.

UNSW-NB15准确率分析

这里我们看一下UNSW-NB15数据集使用各种算法的准确率的分析. 这里的结果来源于以下的论文.

@article{moustafa2016evaluation,
title={The evaluation of Network Anomaly Detection Systems: Statistical analysis of the UNSW-NB15 data set and the comparison with the KDD99 data set},
author={Moustafa, Nour and Slay, Jill},
journal={Information Security Journal: A Global Perspective},
volume={25},
number={1-3},
pages={18--31},
year={2016},
publisher={Taylor \& Francis}
}

在这里会使用五种算法来进行评估: The five techniques used are Naive Bayes (NB) (Panda & Patra, 2007), Decision Tree (DT) (Bouzida & Cuppens, 2006), Artificial Neural Network (ANN) (Bouzida & Cuppens, 2006; Mukkamala et al., 2005), Logistic Regression (LR) (Mukkamala et al., 2005), and Expectation-Maximization (EM) Clustering (Sharif et al., 2012).

模型评估的标准分别是Accuracy和false alarm rates (FAR). 关于更多评价标准的内容, 可以参考链接: 模型评价指标说明与实践–混淆矩阵的说明

最终文章测试的结果如下图所示, 可以看到准确率大概在85%不到的样子: