Sling Academy
Home/Scikit-Learn/Fetching and Processing the KDDCup99 Dataset in Scikit-Learn

Fetching and Processing the KDDCup99 Dataset in Scikit-Learn

Last updated: December 17, 2024

The KDDCup99 dataset is one of the most popular datasets used in network intrusion detection research. In this article, we will explore how to fetch and process the KDDCup99 dataset using the Scikit-Learn library in Python. Fetching datasets, preprocessing, and understanding their structure are critical steps in any machine learning project.

Fetching the KDDCup99 Dataset

Scikit-Learn provides a simple utility to fetch datasets directly from various pages, including those hosted online. To begin, make sure you have Scikit-Learn installed in your Python environment. You can install it via pip if you haven't done so:

!pip install scikit-learn

Once Scikit-Learn is set up, you can fetch the dataset:

from sklearn.datasets import fetch_kddcup99

# Fetch the KDDCup99 dataset
kddcup99 = fetch_kddcup99()

This function fetches the dataset from an online repository if it is not already present locally. The fetched object contains two main attributes: data and target.

Exploring the Dataset

The data attribute contains the features of the dataset, while the target attribute holds the target labels you’re predicting. Let's take a quick look at the structure and content of these attributes:

# Display the shape of the data and target
print("Data shape:", kddcup99.data.shape)
print("Target shape:", kddcup99.target.shape)

# Display the first few entries
data_sample = kddcup99.data[:5]
target_sample = kddcup99.target[:5]
print("Data Sample:\n", data_sample)
print("Target Sample:\n", target_sample)

When exploring large datasets, it's crucial to have a sense of the different features' types and their potential preprocessing needs.

Preprocessing the Dataset

The KDDCup99 dataset includes categorical data, requiring some preprocessing before being used for a machine learning model. Here is a way to proceed:

import numpy as np
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

# Dropping the field 'num_outbound_cmds'
col_deleted_indexes = np.where(np.all(kddcup99.data == 0, axis=0))

X = np.delete(kddcup99.data, col_deleted_indexes, axis=1)
y = kddcup99.target

# Categorical columns encode
def encode_categoricals(X):
    encoders = []
    X_encoded = X.copy()
    for col_idx in range(X.shape[1]):
        try:
            encoder = LabelEncoder()
            X_encoded[:, col_idx] = encoder.fit_transform(X[:, col_idx])
            encoders.append(encoder)
        except TypeError:
            pass
        else:
            encoders.append(None)
    return X_encoded, encoders

# Normalizing via MinMaxScaler
X_encoded, encoders = encode_categoricals(X.astype(str))
scaler = MinMaxScaler()
X_preprocessed = scaler.fit_transform(X_encoded)

This preprocessing involves two main steps: encoding categorical data into numerical values using LabelEncoder, and scaling the numerical data between 0 and 1 with MinMaxScaler. Such preprocessing is critical, as most machine learning models work best with numerical and normalized data.

Training a Model

With the KDDCup99 dataset preprocessed, you are ready to train a machine learning model. Let's train a simple Logistic Regression model using Scikit-Learn:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Split the preprocessed dataset
X_train, X_test, y_train, y_test = train_test_split(X_preprocessed, y, test_size=0.2, random_state=42)

# Train a Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Evaluate the model
predictions = model.predict(X_test)
print(classification_report(y_test, predictions))

This code-snippet divides the preprocessed dataset into training and testing segments. After training the Logistic Regression model, we report its classification performance. Adjusting model parameters, experimenting with different algorithms, creating ensemble models, or fine-tuning preprocessing can help improve classification performance.

Now, you should have a good understanding of how to fetch, process, and utilize the KDDCup99 dataset with Scikit-Learn to train and evaluate a basic model. Use these foundations to build a more comprehensive intrusion detection system.

Next Article: Scikit-Learn's `fetch_lfw_people`: An Image Classification Example

Previous Article: Scikit-Learn's `fetch_covtype` for Forest Cover Type Classification

Series: Scikit-Learn Tutorials

Scikit-Learn

You May Also Like

  • Generating Gaussian Quantiles with Scikit-Learn
  • Spectral Biclustering with Scikit-Learn
  • Scikit-Learn Complete Cheat Sheet
  • ValueError: Estimator Does Not Support Sparse Input in Scikit-Learn
  • Scikit-Learn TypeError: Cannot Broadcast Due to Shape Mismatch
  • AttributeError: 'dict' Object Has No Attribute 'predict' in Scikit-Learn
  • KeyError: Missing 'param_grid' in Scikit-Learn GridSearchCV
  • Scikit-Learn ValueError: 'max_iter' Must Be Positive Integer
  • Fixing Log Function Error with Negative Values in Scikit-Learn
  • RuntimeError: Distributed Computing Backend Not Found in Scikit-Learn
  • Scikit-Learn TypeError: '<' Not Supported Between 'str' and 'int'
  • AttributeError: GridSearchCV Has No Attribute 'fit_transform' in Scikit-Learn
  • Fixing Scikit-Learn Split Error: Number of Splits > Number of Samples
  • Scikit-Learn TypeError: Cannot Concatenate 'str' and 'int'
  • ValueError: Cannot Use 'predict' Before Fitting Model in Scikit-Learn
  • Fixing AttributeError: NoneType Has No Attribute 'predict' in Scikit-Learn
  • Scikit-Learn ValueError: Cannot Reshape Array of Incorrect Size
  • LinAlgError: Matrix is Singular to Machine Precision in Scikit-Learn
  • Fixing TypeError: ndarray Object is Not Callable in Scikit-Learn