Scikit-learn's AffinityPropagation is a powerful clustering algorithm that identifies exemplars among the data points and forms clusters of data points around these exemplars. This technique is particularly useful because it does not require specifying the number of clusters before running the algorithm. Let's dive into a step-by-step guide on how to use AffinityPropagation from scikit-learn.
Introduction to Affinity Propagation
Clustering is a common unsupervised machine learning task where the objective is to group similar data points together. Unlike traditional K-Means, Affinity Propagation works by exchanging messages between data points until a high-quality set of exemplars and corresponding clusters emerges. It is well-suited for cases when the number of clusters is not known beforehand.
Setting Up the Environment
Before we get started, make sure you have scikit-learn installed. You can set it up using pip:
pip install scikit-learnImporting Libraries
First, we import the necessary libraries. Affinity Propagation is available in the cluster module of scikit-learn. Additionally, we'll use numpy for numerical operations and matplotlib for visualization:
import numpy as np
from sklearn.cluster import AffinityPropagation
from sklearn import metrics
from sklearn.datasets import make_blobs
import matplotlib.pyplot as pltGenerating Sample Data
Let's create a sample dataset with distinct clusters using make_blobs from scikit-learn:
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=300, centers=centers, cluster_std=0.5, random_state=0)Applying Affinity Propagation
Now, we apply the Affinity Propagation algorithm to the dataset:
# Compute Affinity Propagation
af = AffinityPropagation(preference=-50).fit(X)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_
n_clusters_ = len(cluster_centers_indices)
print(f'Estimated number of clusters: {n_clusters_}')
print(f'Cluster centers indices: {cluster_centers_indices}')Evaluating the Clustering
Evaluation is crucial to understand the quality of the clustering. One metric that is commonly used is the Adjusted Rand Index (ARI), which measures the similarity of the clustering to the ground truth:
print("Adjusted Rand Index:", metrics.adjusted_rand_score(labels_true, labels))Visualizing the Clusters
Visualization allows us to better understand the clustering results. We will plot the data points and use different colors for different clusters, highlighting the exemplar data points:
# Plot the results
plt.figure(figsize=(8, 6))
colors = plt.cm.Spectral(np.linspace(0, 1, len(cluster_centers_indices)))
for k, col in zip(range(n_clusters_), colors):
class_members = labels == k
cluster_center = X[cluster_centers_indices[k]]
plt.plot(X[class_members, 0], X[class_members, 1], '.', color=col)
plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=14)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()Tuning Preferences
The parameter preference influences the number of clusters by adjusting the affinity model. A lower preference leads to more clusters, while a higher preference results in fewer clusters. Experiment with various preference values to see its effect on clustering.
Conclusion
In this guide, we walked through the process of implementing and understanding Affinity Propagation using Scikit-Learn. This method is beneficial for exploratory data analysis, where the number of clusters is unknown and could vary. As with any algorithm, interpretation and evaluation are key, so ensure that clustering results align with your data exploration goals.