SciPy cluster.hierarchy.linkage() function (with examples)

Updated: March 7, 2024 By: Guest Contributor Post a comment

The scipy.cluster.hierarchy.linkage() function is a powerful tool in the SciPy library, used primarily for hierarchical clustering. Hierarchical clustering is a type of cluster analysis that seeks to build a hierarchy of clusters. In this tutorial, we’ll dive deep into how to use the linkage() function along with practical examples ranging from basic to advanced.

Introduction to Hierarchical Clustering

Hierarchical clustering can be divided into two primary types: agglomerative (bottom-up approach) where every observation starts in its own cluster, and clusters are iteratively merged, and divisive (top-down approach) where all observations start in one cluster that is successively split. The linkage() function uses the agglomerative method.

Importing Necessary Libraries

from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt
import numpy as np

Basic Example

First, let’s start with a simple example. We will create a small dataset and apply the linkage function to it.

data = np.array([[1, 2], [2, 3], [3, 2], [4, 4]])
Z = linkage(data, 'ward')
plt.figure(figsize=(10, 7))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(Z)
plt.show()

Output:

In the dendrogram produced, each step of the agglomerative clustering is represented. The ‘ward’ method minimizes the variance in each cluster as it merges them.

Understanding Parameters and Methods

The linkage() function accepts various parameters, but the most critical ones are:

  • Method: This defines the criterion used for merging clusters. Common methods include ‘ward’, ‘single’, ‘complete’, ‘average’, and ‘centroid’. Each has a specific way of calculating distances between clusters, which you will see in the examples below.
  • Metric: The metric parameter defines the distance metric to use when calculating distance between observations. Standard metrics include ‘euclidean’, ‘cityblock’ (manhattan), and ‘cosine’, among others.

Using Different Methods

Let’s explore how different methods affect the clustering by applying them to the same dataset.

data = np.array([[1, 2], [2, 3], [3, 2], [4, 4]])

methods = ['single', 'complete', 'average', 'centroid', 'ward']
for method in methods:
    Z = linkage(data, method)
    plt.figure(figsize=(10, 7))
    plt.title(f'Hierarchical Clustering Dendrogram ({method})')
    plt.xlabel('sample index')
    plt.ylabel('distance')
    dendrogram(Z)
    plt.show()

You’ll see 5 dendrograms:

Comparing the dendrograms, ‘single’ method shows the clustering process where the closest pair of clusters is merged at each step. However, ‘ward’ method yields more balanced and informative clusters by considering the variance.

Advanced Usage: Custom Distance Metric

Moving to more advanced usage, we can use a custom distance metric instead of predefined ones. This is achieved by preprocessing the data with the desired metric and then providing the processed matrix to the linkage() function.

from scipy.spatial.distance import pdist, squareform

np.random.seed(123)

data_advanced = np.random.rand(10, 2)
dist_matrix = squareform(pdist(data_advanced, 'minkowski', p=3))
Z = linkage(dist_matrix, 'ward')
plt.figure(figsize=(10, 7))
plt.title('Hierarchical Clustering Dendrogram with Custom Metric')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(Z)
plt.show()

Output:

Note that when using a custom distance matrix with linkage(), you must ensure the matrix is in the condensed form or square form, as required by the function.

Combining with Other SciPy Functions

The linkage() function can be effectively combined with other functions from the SciPy library for more comprehensive analyses. For instance, using fcluster() to cut the dendrogram at a specified distance to get cluster labels:

from scipy.cluster.hierarchy import fcluster
cluster_labels = fcluster(Z, t=5, criterion='distance')
print(cluster_labels)

Conclusion

Understanding and utilizing the scipy.cluster.hierarchy.linkage() function can greatly enhance your data analysis, clustering projects. By experimenting with different methods and metrics, as well as incorporating custom distance metrics, you can adapt this powerful tool to a wide range of datasets and requirements. This tutorial provided a solid foundation, but practice with varied datasets is key to mastering hierarchical clustering.