Scikit-Learn's `fetch_lfw_people`: An Image Classification Example

Scikit-Learn is a powerful Python library that's widely used for machine learning tasks. It offers various datasets, including some image datasets, for experimental and educational purposes. One such dataset is the Labeled Faces in the Wild (LFW) dataset. In this article, we'll explore fetch_lfw_people, a function from Scikit-Learn that helps you work with this dataset for face recognition tasks. We'll walk through how to load the data, preprocess it, and apply a simple classifier.

Getting Started
Loading the Dataset
Exploring the Data
Dimensionality Reduction
Training a Classifier
Conclusion

Getting Started

Before we dive into the code, ensure you have Scikit-Learn installed in your environment. You can install Scikit-Learn using pip:

pip install scikit-learn

Additionally, you will need matplotlib and numpy for data visualization and array handling respectively:

pip install matplotlib numpy

Loading the Dataset

The fetch_lfw_people function allows you to download and load the LFW dataset with minimal effort. The dataset consists of images labeled with the names of the people in the photos. Here's how you can load it:

from sklearn.datasets import fetch_lfw_people

# Load the labeled faces in the wild dataset
lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4, color=False)

# Get data and labels
X = lfw_people.data  # the images as flattened arrays
y = lfw_people.target  # the labels (integer indices)

This code snippet fetches the dataset with faces having a minimum of 70 images for each person. It reduces the workload by resizing each image and converting them to grayscale.

Exploring the Data

To understand the dataset structure, let's first check its dimensions and explore some of the images:

n_samples, h, w = lfw_people.images.shape
n_features = X.shape[1]

print(f"Total dataset size:
 n_samples: {n_samples}
 height: {h}
 width: {w}
 n_features: {n_features}")

# Visualizing some samples
import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, 5, figsize=(10, 2))
for i, ax in enumerate(axes):
    ax.imshow(lfw_people.images[i], cmap='gray')
    ax.set_title(lfw_people.target_names[lfw_people.target[i]])
plt.show()

This script prints the dimensions and visualizes a few samples from the dataset. It allows us to confirm that images are indeed in grayscale, reducing complexity for our analysis.

Dimensionality Reduction

Working with high-dimensional data such as images can be computationally expensive. Thus, dimensionality reduction techniques like PCA (Principal Component Analysis) are often applied. We'll use PCA to project the dataset onto a space with fewer dimensions.

from sklearn.decomposition import PCA

n_components = 150
print(f"Extracting the top {n_components} eigenfaces")
pca = PCA(n_components=n_components, svd_solver='randomized', whiten=True).fit(X)
X_pca = pca.transform(X)

The choice of n_components is usually a trade-off between computational efficiency and the retaining of useful variance.

Training a Classifier

To classify these reduced-dimension face images, we can use a Support Vector Machine (SVM), a versatile and widely used classifier. We'll employ the SVM from Scikit-Learn with a linear kernel for simplicity.

from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Split the dataset into a training and testing set
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.25, random_state=42)

# Create and train an SVM classifier
target_names = lfw_people.target_names
clf = SVC(kernel='linear', class_weight='balanced')
clf.fit(X_train, y_train)

# Evaluate on the test set
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred, target_names=target_names))

The classifier is trained on the PCA-transformed training data and then evaluated on the test data. The classification report gives insights into precision, recall, and F1-score for each class.

Conclusion

Using the fetch_lfw_people function from Scikit-Learn, we can easily access and experiment with the LFW face dataset. By employing PCA for dimensionality reduction and an SVM as a classifier, we build an efficient pipeline for face recognition tasks. This tutorial provides a glimpse into preprocessing steps and a foundational approach to image classification in Python, demonstrating yet another powerful feature of Scikit-Learn.

Next Article: Using Scikit-Learn's `fetch_olivetti_faces` for Face Recognition

Previous Article: Fetching and Processing the KDDCup99 Dataset in Scikit-Learn

Series: Scikit-Learn Tutorials

Scikit-Learn