SciPy cluster.hierarchy.fclusterdata() function (with examples)

Updated: March 7, 2024 By: Guest Contributor Post a comment

Introduction

In the realm of data science and machine learning, hierarchical clustering is a powerful technique for uncovering the natural grouping within a dataset without prior knowledge of the number of clusters. Python’s SciPy library offers a robust set of tools for hierarchical clustering, including the fclusterdata() function within its cluster.hierarchy module. This tutorial aims to introduce the fclusterdata() function, explaining its purpose, parameters, and providing a range of examples from basic to advanced use cases.

Understanding fclusterdata()

The fclusterdata() function provides an easy-to-use interface to perform hierarchical clustering and automatically form clusters based on a threshold criterion. The function primarily takes an array of observations and returns an array of cluster labels. Its signature and primary parameters are as follows:

scipy.cluster.hierarchy.fclusterdata(X, t, criterion='inconsistent', metric='euclidean', method='single', depth=2, R=None, monocrit=None)

Where:

  • X is an array of observations.
  • t refers to the threshold to apply when forming clusters.
  • criterion specifies the criterion for forming clusters (e.g., ‘maxclust’, ‘inconsistent’, ‘distance’).
  • metric defines the distance metric to use (default is ‘euclidean’).
  • method indicates the linkage method (e.g., ‘single’, ‘complete’, ‘average’).

Basic Example

Let’s start with a simple example that demonstrates how to use fclusterdata() to perform hierarchical clustering on a small dataset:

from scipy.cluster.hierarchy import fclusterdata
import numpy as np

# Sample dataset
X = np.array([[1, 2], [2, 3], [2, 2], [8, 7], [8, 8], [25, 80]])
# Perform hierarchical clustering
cluster_labels = fclusterdata(X, 2, criterion='maxclust')
print(cluster_labels)

Output:

[1 1 1 1 1 2]

In this example, the dataset is clustered into a maximum of two clusters based on the ‘maxclust’ criterion, revealing the natural grouping within the data.

Using Different Criteria and Metrics

Next, we’ll explore how changing the criterion and metric parameters affects the clustering results:

from scipy.cluster.hierarchy import fclusterdata
import numpy as np

X = np.array([[1, 2], [2, 3], [2, 2], [8, 7], [8, 8], [25, 80]])

# Using the 'distance' criterion with the Manhattan distance metric
cluster_labels = fclusterdata(X, 5, criterion='distance', metric='cityblock')
print(cluster_labels)

Output:

[1 1 1 2 2 3]

This example demonstrates the impact of changing the distance metric to ‘cityblock’ (Manhattan distance) and using the ‘distance’ criterion.

Advanced Usage

For more sophisticated analysis, fclusterdata() can be used in conjunction with other clustering parameters or for large and complex datasets. Here’s an example that uses the ‘ward’ linkage method to reveal deeper insights into a larger dataset:

from scipy.cluster.hierarchy import fclusterdata
import numpy as np

# Generating a larger dataset
X = np.random.randn(100, 2)

# Using the 'ward' linkage method
cluster_labels = fclusterdata(X, 10, criterion='maxclust', metric='euclidean', method='ward')
print(cluster_labels)

Output (vary, due to the randomness):

[ 6  6  1  2  7  4  6  3  6  2  3  4  4  3  4 10  6  4  3  8  2  3  4  2
  2  3  3  5  5  1  8  4  9  1  4 10  3 10 10  3  4 10  2  4  4 10 10 10
 10  5  6  8  2  1  4  4  3  9  3 10  9  8  1  5  8  3  9  9 10  2  4  5
  5  6  7  9  8  9  4  1  3  4 10  8 10  4  4  8  1 10  5  1  2  2  6  8
  4  6  7  3]

In this scenario, by employing the ‘ward’ linkage method, the clustering process is more sensitive to variance within groups, making it well-suited for more complex datasets.

Conclusion

The fclusterdata() function in SciPy’s cluster.hierarchy offers a flexible approach to hierarchical clustering, accommodating various criteria, metrics, and methods. Through the examples provided, this tutorial has demonstrated the function’s utility from basic to advanced levels, allowing readers to gain practical insights into its application for diverse clustering needs.