Fetching and Processing the KDDCup99 Dataset in Scikit-Learn

The KDDCup99 dataset is one of the most popular datasets used in network intrusion detection research. In this article, we will explore how to fetch and process the KDDCup99 dataset using the Scikit-Learn library in Python. Fetching datasets, preprocessing, and understanding their structure are critical steps in any machine learning project.

Fetching the KDDCup99 Dataset
Exploring the Dataset
Preprocessing the Dataset
Training a Model

Fetching the KDDCup99 Dataset

Scikit-Learn provides a simple utility to fetch datasets directly from various pages, including those hosted online. To begin, make sure you have Scikit-Learn installed in your Python environment. You can install it via pip if you haven't done so:

!pip install scikit-learn

Once Scikit-Learn is set up, you can fetch the dataset:

from sklearn.datasets import fetch_kddcup99

# Fetch the KDDCup99 dataset
kddcup99 = fetch_kddcup99()

This function fetches the dataset from an online repository if it is not already present locally. The fetched object contains two main attributes: data and target.

Exploring the Dataset

The data attribute contains the features of the dataset, while the target attribute holds the target labels you’re predicting. Let's take a quick look at the structure and content of these attributes:

# Display the shape of the data and target
print("Data shape:", kddcup99.data.shape)
print("Target shape:", kddcup99.target.shape)

# Display the first few entries
data_sample = kddcup99.data[:5]
target_sample = kddcup99.target[:5]
print("Data Sample:\n", data_sample)
print("Target Sample:\n", target_sample)

When exploring large datasets, it's crucial to have a sense of the different features' types and their potential preprocessing needs.

Preprocessing the Dataset

The KDDCup99 dataset includes categorical data, requiring some preprocessing before being used for a machine learning model. Here is a way to proceed:

import numpy as np
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

# Dropping the field 'num_outbound_cmds'
col_deleted_indexes = np.where(np.all(kddcup99.data == 0, axis=0))

X = np.delete(kddcup99.data, col_deleted_indexes, axis=1)
y = kddcup99.target

# Categorical columns encode
def encode_categoricals(X):
    encoders = []
    X_encoded = X.copy()
    for col_idx in range(X.shape[1]):
        try:
            encoder = LabelEncoder()
            X_encoded[:, col_idx] = encoder.fit_transform(X[:, col_idx])
            encoders.append(encoder)
        except TypeError:
            pass
        else:
            encoders.append(None)
    return X_encoded, encoders

# Normalizing via MinMaxScaler
X_encoded, encoders = encode_categoricals(X.astype(str))
scaler = MinMaxScaler()
X_preprocessed = scaler.fit_transform(X_encoded)

This preprocessing involves two main steps: encoding categorical data into numerical values using LabelEncoder, and scaling the numerical data between 0 and 1 with MinMaxScaler. Such preprocessing is critical, as most machine learning models work best with numerical and normalized data.

Training a Model

With the KDDCup99 dataset preprocessed, you are ready to train a machine learning model. Let's train a simple Logistic Regression model using Scikit-Learn:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Split the preprocessed dataset
X_train, X_test, y_train, y_test = train_test_split(X_preprocessed, y, test_size=0.2, random_state=42)

# Train a Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Evaluate the model
predictions = model.predict(X_test)
print(classification_report(y_test, predictions))

This code-snippet divides the preprocessed dataset into training and testing segments. After training the Logistic Regression model, we report its classification performance. Adjusting model parameters, experimenting with different algorithms, creating ensemble models, or fine-tuning preprocessing can help improve classification performance.

Now, you should have a good understanding of how to fetch, process, and utilize the KDDCup99 dataset with Scikit-Learn to train and evaluate a basic model. Use these foundations to build a more comprehensive intrusion detection system.

Next Article: Scikit-Learn's `fetch_lfw_people`: An Image Classification Example

Previous Article: Scikit-Learn's `fetch_covtype` for Forest Cover Type Classification

Series: Scikit-Learn Tutorials

Scikit-Learn