Sling Academy
Home/Scikit-Learn/Linear Discriminant Analysis (LDA) with Scikit-Learn

Linear Discriminant Analysis (LDA) with Scikit-Learn

Last updated: December 17, 2024

Linear Discriminant Analysis (LDA) is a method used in statistics and machine learning for dimensionality reduction. While similar in concept to Principal Component Analysis (PCA), LDA is more powerful because it takes the target classes into account, attempting to create combinations of predictors that best separate the classes.

In this guide, we will walk through using LDA with Python's Scikit-Learn library. We will start by understanding the basic concepts, then proceed to a practical application.

Understanding LDA

LDA seeks to reduce the dimensional space while preserving the class discriminatory information. It aims to maximize the ratio of between-class variance to the within-class variance in any dataset, thereby guaranteeing maximum separability.

Main Steps in LDA:

  • Compute the mean vectors of each class.
  • Compute the Scattering matrix within classes and between classes.
  • Compute the eigenvectors (optional transformation solution) for the Scattering matrix.
  • Choose the top k eigenvectors that form the lower-dimensional space.
  • Construct the transformation matrix.
  • Transform the samples into the new space.

Using LDA with Scikit-Learn

Installing Scikit-Learn

Make sure you have Scikit-Learn installed. If not, install it using:

pip install scikit-learn

Step-by-Step Implementation

Let's dive into implementing LDA using Scikit-Learn. For this example, we'll use the famous Iris dataset:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Apply LDA
lda = LinearDiscriminantAnalysis(n_components=2)
X_train_lda = lda.fit_transform(X_train, y_train)
X_test_lda = lda.transform(X_test)

Now, we will train a simple model using the transformed data:

from sklearn.linear_model import LogisticRegression

# Train a logistic regression model
model = LogisticRegression()
model.fit(X_train_lda, y_train)

# Make predictions and evaluate the model
y_pred = model.predict(X_test_lda)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

Visualizing the Results

One of the advantages of the reduced dimensional space is the ability to visualize it:

import matplotlib.pyplot as plt
import numpy as np

# Plot the LDA decision boundary and scatterplot
plt.figure(figsize=(8,6))
colors = ['r', 'g', 'b']
markers = ['s', 'x', 'o']

for i, color, marker in zip(np.unique(y_train), colors, markers):
    plt.scatter(x=X_train_lda[y_train == i, 0], y=X_train_lda[y_train == i, 1],
                alpha=0.7, c=color, marker=marker, label=iris.target_names[i])

plt.xlabel('LD 1')
plt.ylabel('LD 2')
plt.title('LDA: Iris Train Set')
plt.legend(loc='lower right')
plt.grid()
plt.tight_layout()
plt.show()

You should see a clear separation between the different species in the iris dataset, demonstrating how LDA can help explore separability within classes visually.

Conclusion

Linear Discriminant Analysis is an essential tool in any data scientist's repertoire. It not only provides a means to reduce dimensionality but also ensures that the reduced feature space retains critical discriminability information. Scikit-Learn makes it straightforward to apply LDA, along with many visualization options for better understanding the results.

Next Article: Quadratic Discriminant Analysis in Scikit-Learn

Previous Article: Understanding Scikit-Learn's `TruncatedSVD` for LSA

Series: Scikit-Learn Tutorials

Scikit-Learn

You May Also Like

  • Generating Gaussian Quantiles with Scikit-Learn
  • Spectral Biclustering with Scikit-Learn
  • Scikit-Learn Complete Cheat Sheet
  • ValueError: Estimator Does Not Support Sparse Input in Scikit-Learn
  • Scikit-Learn TypeError: Cannot Broadcast Due to Shape Mismatch
  • AttributeError: 'dict' Object Has No Attribute 'predict' in Scikit-Learn
  • KeyError: Missing 'param_grid' in Scikit-Learn GridSearchCV
  • Scikit-Learn ValueError: 'max_iter' Must Be Positive Integer
  • Fixing Log Function Error with Negative Values in Scikit-Learn
  • RuntimeError: Distributed Computing Backend Not Found in Scikit-Learn
  • Scikit-Learn TypeError: '<' Not Supported Between 'str' and 'int'
  • AttributeError: GridSearchCV Has No Attribute 'fit_transform' in Scikit-Learn
  • Fixing Scikit-Learn Split Error: Number of Splits > Number of Samples
  • Scikit-Learn TypeError: Cannot Concatenate 'str' and 'int'
  • ValueError: Cannot Use 'predict' Before Fitting Model in Scikit-Learn
  • Fixing AttributeError: NoneType Has No Attribute 'predict' in Scikit-Learn
  • Scikit-Learn ValueError: Cannot Reshape Array of Incorrect Size
  • LinAlgError: Matrix is Singular to Machine Precision in Scikit-Learn
  • Fixing TypeError: ndarray Object is Not Callable in Scikit-Learn