A Step-by-Step Guide to Scikit-Learn's `AffinityPropagation`

Scikit-learn's AffinityPropagation is a powerful clustering algorithm that identifies exemplars among the data points and forms clusters of data points around these exemplars. This technique is particularly useful because it does not require specifying the number of clusters before running the algorithm. Let's dive into a step-by-step guide on how to use AffinityPropagation from scikit-learn.

Introduction to Affinity Propagation
Setting Up the Environment
Importing Libraries
Generating Sample Data
Applying Affinity Propagation
Evaluating the Clustering
Visualizing the Clusters
Tuning Preferences
Conclusion

Introduction to Affinity Propagation

Clustering is a common unsupervised machine learning task where the objective is to group similar data points together. Unlike traditional K-Means, Affinity Propagation works by exchanging messages between data points until a high-quality set of exemplars and corresponding clusters emerges. It is well-suited for cases when the number of clusters is not known beforehand.

Setting Up the Environment

Before we get started, make sure you have scikit-learn installed. You can set it up using pip:

pip install scikit-learn

Importing Libraries

First, we import the necessary libraries. Affinity Propagation is available in the cluster module of scikit-learn. Additionally, we'll use numpy for numerical operations and matplotlib for visualization:

import numpy as np
from sklearn.cluster import AffinityPropagation
from sklearn import metrics
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

Generating Sample Data

Let's create a sample dataset with distinct clusters using make_blobs from scikit-learn:

# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=300, centers=centers, cluster_std=0.5, random_state=0)

Applying Affinity Propagation

Now, we apply the Affinity Propagation algorithm to the dataset:

# Compute Affinity Propagation
af = AffinityPropagation(preference=-50).fit(X)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_

n_clusters_ = len(cluster_centers_indices)
print(f'Estimated number of clusters: {n_clusters_}')
print(f'Cluster centers indices: {cluster_centers_indices}')

Evaluating the Clustering

Evaluation is crucial to understand the quality of the clustering. One metric that is commonly used is the Adjusted Rand Index (ARI), which measures the similarity of the clustering to the ground truth:

print("Adjusted Rand Index:", metrics.adjusted_rand_score(labels_true, labels))

Visualizing the Clusters

Visualization allows us to better understand the clustering results. We will plot the data points and use different colors for different clusters, highlighting the exemplar data points:

# Plot the results
plt.figure(figsize=(8, 6))
colors = plt.cm.Spectral(np.linspace(0, 1, len(cluster_centers_indices)))

for k, col in zip(range(n_clusters_), colors):
    class_members = labels == k
    cluster_center = X[cluster_centers_indices[k]]
    plt.plot(X[class_members, 0], X[class_members, 1], '.', color=col)
    plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=14)
    
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

Tuning Preferences

The parameter preference influences the number of clusters by adjusting the affinity model. A lower preference leads to more clusters, while a higher preference results in fewer clusters. Experiment with various preference values to see its effect on clustering.

Conclusion

In this guide, we walked through the process of implementing and understanding Affinity Propagation using Scikit-Learn. This method is beneficial for exploratory data analysis, where the number of clusters is unknown and could vary. As with any algorithm, interpretation and evaluation are key, so ensure that clustering results align with your data exploration goals.

Next Article: Understanding Agglomerative Clustering in Scikit-Learn

Previous Article: Visualizing Calibration Curves with Scikit-Learn's `CalibrationDisplay`

Series: Scikit-Learn Tutorials

Scikit-Learn