26
Intrusion detection and identification based on Supelec TCPdump data and KDD1999 Sylvain GOMBAULT et Wei WANG Département Réseaux, Sécurité et Multimédia École Nationale Supérieure des Télécommunications de Bretagne, France

Intrusion detection and identification based on Supelec TCPdump data and KDD1999

  • Upload
    milt

  • View
    46

  • Download
    0

Embed Size (px)

DESCRIPTION

Intrusion detection and identification based on Supelec TCPdump data and KDD1999. Sylvain GOMBAULT et Wei WANG. Département Réseaux, Sécurité et Multimédia École Nationale Supérieure des Télécommunications de Bretagne , France. Outline. Deep analysis of kdd99 transformation and database - PowerPoint PPT Presentation

Citation preview

Page 1: Intrusion detection and identification based on Supelec TCPdump data and KDD1999

Intrusion detection and identification based on Supelec

TCPdump data and KDD1999Sylvain GOMBAULT et Wei WANG

Département Réseaux, Sécurité et Multimédia

École Nationale Supérieure des Télécommunications de Bretagne, France

Page 2: Intrusion detection and identification based on Supelec TCPdump data and KDD1999

- 2 -

GET/ENST Bretagne

Intrusion detection and identification based on Supelec data and KDD1999

DADDi Reunion, Rennes October 11, 2007

Outline

• Deep analysis of kdd99 transformation and database

• Intrusion detection using Supelec TCPdump data

• Building multiple behavioral models for network intrusion identification (Monam 2007)

• kNN based Intrusion detection and identification• PCA based intrusion detection and identification

• Conclusion & future work

Page 3: Intrusion detection and identification based on Supelec TCPdump data and KDD1999

- 3 -

GET/ENST Bretagne

Intrusion detection and identification based on Supelec data and KDD1999

DADDi Reunion, Rennes October 11, 2007

Data transformation and explicit Approach

• Fonction de transformation • Choix d’attributs pertinents• Définition des propriétés à satisfaire (fonction riche)

• Deux étapes après transformation des données brutes• Construction du modèle par apprentissage de données étiquetées • Phase de détection : données à classifier (analyser)

serviceservice

domain_u http private time auth

normalnormal Protocol_typeProtocol_type

tcp udp

normal DOS

Probe normal

Classification

durée service Protocole Classe

230s http tcp normal

0s private udp DOS

Transformation du trafic brut

Page 4: Intrusion detection and identification based on Supelec TCPdump data and KDD1999

- 4 -

GET/ENST Bretagne

Intrusion detection and identification based on Supelec data and KDD1999

DADDi Reunion, Rennes October 11, 2007

Fonction de transformation

• Données considérées :• Trafic réseau

• Pour alimenter l’outil de classification à partir du trafic brut :

• Fonction de transformation T• R : ensemble du trafic brut• I : ensemble d’items structurés

Page 5: Intrusion detection and identification based on Supelec TCPdump data and KDD1999

- 5 -

GET/ENST Bretagne

Intrusion detection and identification based on Supelec data and KDD1999

DADDi Reunion, Rennes October 11, 2007

Analysis of kdd99 database (1)

• Learning base : 4 connections have the same 41

attributes but the label is different

• 0,icmp,ecr_i,SF,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.

00,0.00,1,1,1.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00,ipsweep.148774

• 0,icmp,ecr_i,SF,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.

00,0.00,1,1,1.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00,portsweep.345836

• 0,icmp,tim_i,SF,564,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,

0.00,0.00,2,2,1.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00,normal.143855

• 0,icmp,tim_i,SF,564,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,

0.00,0.00,2,2,1.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00,pod.345952

Page 6: Intrusion detection and identification based on Supelec TCPdump data and KDD1999

- 6 -

GET/ENST Bretagne

Intrusion detection and identification based on Supelec data and KDD1999

DADDi Reunion, Rennes October 11, 2007

Analysis of kdd99 database (2)

• Test base (corrected file)

• 71 distinct connections have the same attributes but have the different labels.

• 71503 (22.99% of the total) connections have the same attributes but appear the different labels

• 3 ipsweep (Probing) attack connections have the same attributes of those of smurf (DoS) attack (56608 connections)

• 3 (0.07%) Probing attacks cannot be detected (classifed as DoS attack instead)

Page 7: Intrusion detection and identification based on Supelec TCPdump data and KDD1999

- 7 -

GET/ENST Bretagne

Intrusion detection and identification based on Supelec data and KDD1999

DADDi Reunion, Rennes October 11, 2007

Analysis of kdd99 database

• Test base (corrected file) :• 7563 (97.7% of the total) connections of the snmpgetattack

attack have the same attributes of those of normal• 2.3% of the snmpgetattack have similar attributes as

normal, (but not all the same)• 7563 (46.72% of the total) R2L attack cannot be detected

(they are classifed as normal)

Page 8: Intrusion detection and identification based on Supelec TCPdump data and KDD1999

- 8 -

GET/ENST Bretagne

Intrusion detection and identification based on Supelec data and KDD1999

DADDi Reunion, Rennes October 11, 2007

Améliorations du C4.5 (for kdd99)

Page 9: Intrusion detection and identification based on Supelec TCPdump data and KDD1999

- 9 -

GET/ENST Bretagne

Intrusion detection and identification based on Supelec data and KDD1999

DADDi Reunion, Rennes October 11, 2007

Supelec TCPdump data

• Supelec TCPdump (trafic brut) -> using BRO to construct attributes

• Transformation du trafic tcpdump en 41 attributs

serviceservice

domain_u http private time auth

normalnormal Protocol_typeProtocol_type

tcp udp

normal DOS

Probe normal

Classification

durée service Protocole Classe

230s http tcp normal

0s private udp DOS

Transformation du trafic brut par BRO

Page 10: Intrusion detection and identification based on Supelec TCPdump data and KDD1999

- 10 -

GET/ENST Bretagne

Intrusion detection and identification based on Supelec data and KDD1999

DADDi Reunion, Rennes October 11, 2007

Supelec TCPdump Data (suite)

• Transformation du trafic TCPdump en 41 attributs

• Use BRO

• 4 catégories d’attributs :

• Données générales de la connexion (niveau réseau et transport)

• Service, Type de protocole (TCP, UDP ou ICMP), …

• Attributs liés à la couche application

• Nombre de création de fichier, Nombre de shells, …   

• Attributs statistiques sur les connexions situées dans les 2 dernières

secondes de la connexion courante

• Attributs statistiques sur les 100 dernières connexions

Page 11: Intrusion detection and identification based on Supelec TCPdump data and KDD1999

- 11 -

GET/ENST Bretagne

Intrusion detection and identification based on Supelec data and KDD1999

DADDi Reunion, Rennes October 11, 2007

Learning and test data sets• Base d’apprentissage (from KDD99)

• ~5 millions de connexions (10% (494021) utilisées from KDD99 learning set)

• 4 classes d’attaques + trafic normal• Probing (4), DoS (6), U2R (4), R2L (9).

• Base de test (from Supelec)• Normal

• Use of 0-29 files of 101 tcpdump files• 30Gb size• 4652059 connexions

• TCP: 1173654; UDP: 3254160; ICMP: 224245• Only normal data

• Attack• 10 connections• Cross-http, write-http, login-http, execute-http

Page 12: Intrusion detection and identification based on Supelec TCPdump data and KDD1999

- 12 -

GET/ENST Bretagne

Intrusion detection and identification based on Supelec data and KDD1999

DADDi Reunion, Rennes October 11, 2007

Résultats avec les Arbres de décision (c4.5)

• L’algorithme c4.5 introduit par Quinlan avec qq modifications• Processus de construction • Processus de classification

  Normal (%) Probing (%) DoS (%) U2R (%) R2L (%) New (%)

Normal (4652059)

72.3 12.9 14.6 0 0 0.2

Attack (10) 10 0 0 0 0 0

Page 13: Intrusion detection and identification based on Supelec TCPdump data and KDD1999

- 13 -

GET/ENST Bretagne

Intrusion detection and identification based on Supelec data and KDD1999

DADDi Reunion, Rennes October 11, 2007

Intrusion detection and Identification based on KDD99 data

• Building the normal model based on normal data for intrusion detection

• Building individual attack model based on corresponding attack data for intrusion identification

Page 14: Intrusion detection and identification based on Supelec TCPdump data and KDD1999

- 14 -

GET/ENST Bretagne

Intrusion detection and identification based on Supelec data and KDD1999

DADDi Reunion, Rennes October 11, 2007

The general Intrusion detection and Identification Model

Page 15: Intrusion detection and identification based on Supelec TCPdump data and KDD1999

- 15 -

GET/ENST Bretagne

Intrusion detection and identification based on Supelec data and KDD1999

DADDi Reunion, Rennes October 11, 2007

kNN Based intrusion detection• Building normal behavioral model

• Calculate the distances between each test vector t and each vector in the training data set by using Euclidean distance:

• Sort the distance and choose the k nearest neighbors.• Average the k closest distance scores as the anomaly index.

• Detection• If the anomaly index of a test sequence vector t is above a

threshold • the test sequence is then classified as abnormal. • otherwise it is considered as normal.

jjeudis xtxt −£Ω),(2

1)(∑=

−M

i iji xt£Ω

Page 16: Intrusion detection and identification based on Supelec TCPdump data and KDD1999

- 16 -

GET/ENST Bretagne

Intrusion detection and identification based on Supelec data and KDD1999

DADDi Reunion, Rennes October 11, 2007

kNN based intrusion identification

Define normal and individual attack data sets as ;

Identification:

For each test vector t do

Calculate for in each training set;

Find k smallest scores of as k-nearest neighbors;

If more than a half of k nearest neighbors correspond to a specific attack type then

t is identified as

Else If the number of smallest distance that corresponds to an attack type is greater than those of others then

t is identified as

Else then

t is identified as a new attack

End If

End For

lDDD ,,, 21 L

),( jeudis xt jx

),( jeudis xt

kA

kA

pApA

pA

pA

Page 17: Intrusion detection and identification based on Supelec TCPdump data and KDD1999

- 17 -

GET/ENST Bretagne

Intrusion detection and identification based on Supelec data and KDD1999

DADDi Reunion, Rennes October 11, 2007

Principal Component Analysis• Dimension reduction technique for data analysis and compression

• New coordinate system to represent the original large data set• The axes are the eigenvectors associated with the several largest eigenvalues • without sacrificing valuable information in the data set

• Have been applied in face recognition, text categorization, etc.

x

y

u

vOriginal coordinateNew coordinate

PCA methods for intrusion detection

Page 18: Intrusion detection and identification based on Supelec TCPdump data and KDD1999

- 18 -

GET/ENST Bretagne

Intrusion detection and identification based on Supelec data and KDD1999

DADDi Reunion, Rennes October 11, 2007

PCA based normal model building for intrusion detection

∑=

=m

iim 1

1xì

ìxÖ −= ii

X mn×

training_ mn× ∑=

=m

i

Tiim

C1

1ÖÖ

αλ

λ≥

=

=n

ii

k

ii

1

1

)( nkkn <<×

),(,),,(),,( 2211 nn uuu λλλ L

U

Training data (attribute matrix)

Mean vector

Mean-justed matrix Covariance matrix

Eigenvalue-eigenvector pairs

k eigenvectors associated with the k

largest eigenvalue

Page 19: Intrusion detection and identification based on Supelec TCPdump data and KDD1999

- 19 -

GET/ENST Bretagne

Intrusion detection and identification based on Supelec data and KDD1999

DADDi Reunion, Rennes October 11, 2007

Intrusion detection based on PCA model

∑=

=m

iim 1

1xì

ìtÖ −= Öy TU=

= f__ −

)( nkkn <<×U

t

yÖ Uf =

f__ −

ReconstructionProjection

Test data

Mean vector

Anomaly/identification index

Projection coefficient

(Principal component)

Page 20: Intrusion detection and identification based on Supelec TCPdump data and KDD1999

- 20 -

GET/ENST Bretagne

Intrusion detection and identification based on Supelec data and KDD1999

DADDi Reunion, Rennes October 11, 2007

PCA based intrusion detection and intrusion identification

• Intrusion detection• Given a new data vector t, If its anomaly index ε is above a threshold, the

test vector is considered as abnormal• Otherwise, it is classified as normal

• Intrusion identification• Calculate the Euclidean distance between the test vector and its

reconstruction onto each subspace formed by normal data and individual type of attack and set the minimum εi as the identification index.

• If εi is below the predefined threshold θi for a certain individual type of attack, the vector is then identified as this type of attack.

• Otherwise it is identified as a new attack.

Page 21: Intrusion detection and identification based on Supelec TCPdump data and KDD1999

- 21 -

GET/ENST Bretagne

Intrusion detection and identification based on Supelec data and KDD1999

DADDi Reunion, Rennes October 11, 2007

Learning and test data sets for intrusion identification

• Data description:• 41 attributes + name of the class• Text format

• Data for intrusion detection (learning base of kdd99)• Learning data: randomly selected 7000 connections• Test data: 4 classes d’attaques + trafic normal

• Normal data: randomly selected 10,000 normal connections• Attack data: all the other attack connections

• 391,458 DoS attacks, 1,126 R2L attacks, 52 U2R attacks and 4107 Probe attacks.

• Data for intrusion identification (learning base of kdd99)• Learning data:

• Randomly selected 7,000 normal network connections• The former 2,000 back, 10,000 Nepture, 200 Pod, 20,000 Smurf, 800 Teardrop, 40 Guess

passwd, 900 Warezclient, 1000 Ipsweep, 900 Portsweep, 1200 Satan, 200 Nmap, 15 Warezmaster, 25 buffer overflow attack

• Test data• All the other network connections of these types of attacks are used for identification.

Page 22: Intrusion detection and identification based on Supelec TCPdump data and KDD1999

- 22 -

GET/ENST Bretagne

Intrusion detection and identification based on Supelec data and KDD1999

DADDi Reunion, Rennes October 11, 2007

Intrusion detection: results based on PCA and kNN for kdd99 data

Methods Overall data DoS R2L U2R Probe

DR(%)

FPR(%)

DR(%)

FPR(%)

DR(%)

FPR(%)

DR(%)

FPR(%)

DR(%)

FPR(%)

kNN (k=5) 84.3 2.9 87.1 2.9 37.6 1.6 75 4.1 56.4 18.6

PCA 98.8 0.4 99.2 0.2 94.5 4 88.5 0.6 80.7 4

Page 23: Intrusion detection and identification based on Supelec TCPdump data and KDD1999

- 23 -

GET/ENST Bretagne

Intrusion detection and identification based on Supelec data and KDD1999

DADDi Reunion, Rennes October 11, 2007

Intrusion identification: results based on PCA and kNN for kdd99 data

Attack type Attack category

Identification Rate (%)

kNN PCA

k=5 k=7 k=9

guess_ passwd R2L 92.3 92.3 92.3 92.3

warezclient R2L 100 100 100 57.5

warezmaster R2L 80 80 80 100

back DoS 98.5 99.5 98 100

neptune DoS 99.8 99.8 97.7 95.3

pod DoS 100 96.9 100 95.3

smurf DoS 100 100 100 80.5

teardrop DoS 97.7 99.4 97.8 100

buffer overflow U2R 80 80 80 60

ipsweep Probe 97.6 99.2 97.6 6.1

nmap Probe 12.9 12.9 12.9 67.1

portsweep Probe 100 100 100 0

satan Probe 88.2 88.2 88.2 91.5

Page 24: Intrusion detection and identification based on Supelec TCPdump data and KDD1999

- 24 -

GET/ENST Bretagne

Intrusion detection and identification based on Supelec data and KDD1999

DADDi Reunion, Rennes October 11, 2007

kNN and PCA methods comparison

• kNN• No need for training

• Suitable for dynamical envorinment• Require large computation in testing stage

• Need computation (m – dimensionality of vector; n – number of samples)

• PCA• Need considerable computation for training• Leight weight in testing stage

• Need computation (p – number of different attack types; q – number principal components)

• Suitable for detection massive data

)( 2nmO

)( 2nmO

)(mqpO

)(mqpO

Page 25: Intrusion detection and identification based on Supelec TCPdump data and KDD1999

- 25 -

GET/ENST Bretagne

Intrusion detection and identification based on Supelec data and KDD1999

DADDi Reunion, Rennes October 11, 2007

Conclusion

• KDD 99 transformation function didnot extract enough information from the raw data for anomaly detection

• Using the 41 attributes can achieve 72% detection rate of Supelec normal data

• kNN and PCA achieve good detection and identification results based on kdd99 data

• PCA can process massive data sets• Identification process needs attack data set (sometimes it is difficult)

• The 41 attributes may be reduced for light weight detection while remain the detection accuracy

• Use some optimization methods for selecting key attributes in future work

• Early and fast detection of network attacks is important• No need to wait the connection is finished and early detection is our future

work

)( 2nmO )(mqpO

Page 26: Intrusion detection and identification based on Supelec TCPdump data and KDD1999

Merci pour votre attention!

Thank for your attention!

Questions?