NumPy

The Basics

Beyond the Basic

SciPy Tutorials

Intro to SciPy with Examples SciPy show_config() Examples Scipy cluster.vq.whiten() Function SciPy cluster.vq.vq() Examples SciPy kmeans() Function Explained SciPy fcluster() Examples Exploring is_monotonic() in SciPy SciPy Optimal Leaf Ordering SciPy: cut_tree() Function SciPy Dendrogram Tutorial SciPy maxdists() Function SciPy cophenet() Tutorial SciPy Ward Clustering Guide SciPy median() Function Examples SciPy hierarchical clustering SciPy avg() clustering explained SciPy Complete Linkage Clustering SciPy Linkage Function Explained SciPy fclusterdata() Tutorial SciPy's datasets.ascent() Function SciPy datasets.face() 3 Examples SciPy ECG Function Guide SciPy fft.fft() Tutorial SciPy's fft.ifft() Explained SciPy & fft.ifft2() Function fft.ifftn() in SciPy Examples Understanding fft.irfft() SciPy: fft.rfft2() Explained Understanding fft.irfft2() in SciPy SciPy: Working with fft.rfftn() SciPy fft.irfftn() Tutorial Exploring fft.hfft() in SciPy SciPy fft.ihfft() Guide SciPy: fft.hfft2() Function Guide SciPy and fft.hfftn() function SciPy fft.dct() Examples SciPy fft.dctn() Guide SciPy fft.dst() Function Guide SciPy fft.idst() Explained SciPy fft.dstn() Function Guide SciPy fft.ifft() with Examples Understanding fft.fftshift() SciPy fft.ifftshift() Explained SciPy fft.fftfreq() Explained SciPy: fft.set_workers() Guide SciPy fft.set_global_backend() Guide SciPy integrate.quad() Explained SciPy's integrate.quad_vec() SciPy dblquad() Examples SciPy tplquad() Function Guide SciPy integrate.nquad() Guide SciPy's fixed_quad() Function SciPy integrate.trapezoid() Examples SciPy cumulative_trapezoid() Guide SciPy integrate.simpson() Examples SciPy solve_ivp() Examples SciPy and Radau Integration SciPy: solve_bvp() Tutorial SciPy krogh_interpolate() Guide SciPy pchip_interpolate() Guide Scipy griddata() with Examples SciPy interpolate.splrep() Guide SciPy interpolate.splev() Guide SciPy interpolate.splint() Guide SciPy interpolate.spalde() Guide SciPy interpolate.splder() Guide SciPy interpolate.insert() Guide SciPy interpolate.bisplev() Guide Using io.loadmat() in SciPy SciPy: io.savemat() Examples Mastering io.whosmat() in SciPy SciPy io.readsav() Tutorial io.mminfo() in SciPy Explained SciPy io.mmread() Function SciPy io.mmwrite() Explained SciPy's hb_read() in Examples SciPy io.hb_write() Explained SciPy io.wavfile.read() Guide SciPy io.arff.loadarff() Function SciPy linalg.inv() Function SciPy linalg.solve() Explained SciPy solve_banded() Guide SciPy: solveh_banded() Explained SciPy solve_circulant() Func SciPy solve_triangular() Guide SciPy & linalg.det() Function SciPy special.yvp() Function Guide SciPy special.kvp() Explained SciPy itmodstruve0() Examples SciPy special.gammasgn() function

Solving Bugs

SciPy cluster.hierarchy.linkage() function (with examples)

Updated: March 7, 2024 By: Guest Contributor Post a comment

The scipy.cluster.hierarchy.linkage() function is a powerful tool in the SciPy library, used primarily for hierarchical clustering. Hierarchical clustering is a type of cluster analysis that seeks to build a hierarchy of clusters. In this tutorial, we’ll dive deep into how to use the linkage() function along with practical examples ranging from basic to advanced.

Table Of Contents

1 Introduction to Hierarchical Clustering

2 Importing Necessary Libraries

3 Basic Example

4 Understanding Parameters and Methods

5 Using Different Methods

6 Advanced Usage: Custom Distance Metric

7 Combining with Other SciPy Functions

8 Conclusion

Introduction to Hierarchical Clustering

Hierarchical clustering can be divided into two primary types: agglomerative (bottom-up approach) where every observation starts in its own cluster, and clusters are iteratively merged, and divisive (top-down approach) where all observations start in one cluster that is successively split. The linkage() function uses the agglomerative method.

Importing Necessary Libraries

from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt
import numpy as np

Basic Example

First, let’s start with a simple example. We will create a small dataset and apply the linkage function to it.

data = np.array([[1, 2], [2, 3], [3, 2], [4, 4]])
Z = linkage(data, 'ward')
plt.figure(figsize=(10, 7))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(Z)
plt.show()

Output:

In the dendrogram produced, each step of the agglomerative clustering is represented. The ‘ward’ method minimizes the variance in each cluster as it merges them.

Understanding Parameters and Methods

The linkage() function accepts various parameters, but the most critical ones are:

Method: This defines the criterion used for merging clusters. Common methods include ‘ward’, ‘single’, ‘complete’, ‘average’, and ‘centroid’. Each has a specific way of calculating distances between clusters, which you will see in the examples below.
Metric: The metric parameter defines the distance metric to use when calculating distance between observations. Standard metrics include ‘euclidean’, ‘cityblock’ (manhattan), and ‘cosine’, among others.

Using Different Methods

Let’s explore how different methods affect the clustering by applying them to the same dataset.

data = np.array([[1, 2], [2, 3], [3, 2], [4, 4]])

methods = ['single', 'complete', 'average', 'centroid', 'ward']
for method in methods:
    Z = linkage(data, method)
    plt.figure(figsize=(10, 7))
    plt.title(f'Hierarchical Clustering Dendrogram ({method})')
    plt.xlabel('sample index')
    plt.ylabel('distance')
    dendrogram(Z)
    plt.show()

You’ll see 5 dendrograms:

Comparing the dendrograms, ‘single’ method shows the clustering process where the closest pair of clusters is merged at each step. However, ‘ward’ method yields more balanced and informative clusters by considering the variance.

Advanced Usage: Custom Distance Metric

Moving to more advanced usage, we can use a custom distance metric instead of predefined ones. This is achieved by preprocessing the data with the desired metric and then providing the processed matrix to the linkage() function.

from scipy.spatial.distance import pdist, squareform

np.random.seed(123)

data_advanced = np.random.rand(10, 2)
dist_matrix = squareform(pdist(data_advanced, 'minkowski', p=3))
Z = linkage(dist_matrix, 'ward')
plt.figure(figsize=(10, 7))
plt.title('Hierarchical Clustering Dendrogram with Custom Metric')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(Z)
plt.show()

Output:

Note that when using a custom distance matrix with linkage(), you must ensure the matrix is in the condensed form or square form, as required by the function.

Combining with Other SciPy Functions

The linkage() function can be effectively combined with other functions from the SciPy library for more comprehensive analyses. For instance, using fcluster() to cut the dendrogram at a specified distance to get cluster labels:

from scipy.cluster.hierarchy import fcluster
cluster_labels = fcluster(Z, t=5, criterion='distance')
print(cluster_labels)

Conclusion

Understanding and utilizing the scipy.cluster.hierarchy.linkage() function can greatly enhance your data analysis, clustering projects. By experimenting with different methods and metrics, as well as incorporating custom distance metrics, you can adapt this powerful tool to a wide range of datasets and requirements. This tutorial provided a solid foundation, but practice with varied datasets is key to mastering hierarchical clustering.

Next Article: SciPy cluster.hierarchy.fclusterdata() function (with examples)

Previous Article: SciPy cluster.hierarchy.complete() function (4 examples)

Series: SciPy Tutorials: From Basic to Advanced

NumPy