Introduction
In the realm of data science and machine learning, hierarchical clustering is a powerful technique for uncovering the natural grouping within a dataset without prior knowledge of the number of clusters. Python’s SciPy library offers a robust set of tools for hierarchical clustering, including the fclusterdata()
function within its cluster.hierarchy
module. This tutorial aims to introduce the fclusterdata()
function, explaining its purpose, parameters, and providing a range of examples from basic to advanced use cases.
Understanding fclusterdata()
The fclusterdata()
function provides an easy-to-use interface to perform hierarchical clustering and automatically form clusters based on a threshold criterion. The function primarily takes an array of observations and returns an array of cluster labels. Its signature and primary parameters are as follows:
scipy.cluster.hierarchy.fclusterdata(X, t, criterion='inconsistent', metric='euclidean', method='single', depth=2, R=None, monocrit=None)
Where:
X
is an array of observations.t
refers to the threshold to apply when forming clusters.criterion
specifies the criterion for forming clusters (e.g., ‘maxclust’, ‘inconsistent’, ‘distance’).metric
defines the distance metric to use (default is ‘euclidean’).method
indicates the linkage method (e.g., ‘single’, ‘complete’, ‘average’).
Basic Example
Let’s start with a simple example that demonstrates how to use fclusterdata()
to perform hierarchical clustering on a small dataset:
from scipy.cluster.hierarchy import fclusterdata
import numpy as np
# Sample dataset
X = np.array([[1, 2], [2, 3], [2, 2], [8, 7], [8, 8], [25, 80]])
# Perform hierarchical clustering
cluster_labels = fclusterdata(X, 2, criterion='maxclust')
print(cluster_labels)
Output:
[1 1 1 1 1 2]
In this example, the dataset is clustered into a maximum of two clusters based on the ‘maxclust’ criterion, revealing the natural grouping within the data.
Using Different Criteria and Metrics
Next, we’ll explore how changing the criterion and metric parameters affects the clustering results:
from scipy.cluster.hierarchy import fclusterdata
import numpy as np
X = np.array([[1, 2], [2, 3], [2, 2], [8, 7], [8, 8], [25, 80]])
# Using the 'distance' criterion with the Manhattan distance metric
cluster_labels = fclusterdata(X, 5, criterion='distance', metric='cityblock')
print(cluster_labels)
Output:
[1 1 1 2 2 3]
This example demonstrates the impact of changing the distance metric to ‘cityblock’ (Manhattan distance) and using the ‘distance’ criterion.
Advanced Usage
For more sophisticated analysis, fclusterdata()
can be used in conjunction with other clustering parameters or for large and complex datasets. Here’s an example that uses the ‘ward’ linkage method to reveal deeper insights into a larger dataset:
from scipy.cluster.hierarchy import fclusterdata
import numpy as np
# Generating a larger dataset
X = np.random.randn(100, 2)
# Using the 'ward' linkage method
cluster_labels = fclusterdata(X, 10, criterion='maxclust', metric='euclidean', method='ward')
print(cluster_labels)
Output (vary, due to the randomness):
[ 6 6 1 2 7 4 6 3 6 2 3 4 4 3 4 10 6 4 3 8 2 3 4 2
2 3 3 5 5 1 8 4 9 1 4 10 3 10 10 3 4 10 2 4 4 10 10 10
10 5 6 8 2 1 4 4 3 9 3 10 9 8 1 5 8 3 9 9 10 2 4 5
5 6 7 9 8 9 4 1 3 4 10 8 10 4 4 8 1 10 5 1 2 2 6 8
4 6 7 3]
In this scenario, by employing the ‘ward’ linkage method, the clustering process is more sensitive to variance within groups, making it well-suited for more complex datasets.
Conclusion
The fclusterdata()
function in SciPy’s cluster.hierarchy
offers a flexible approach to hierarchical clustering, accommodating various criteria, metrics, and methods. Through the examples provided, this tutorial has demonstrated the function’s utility from basic to advanced levels, allowing readers to gain practical insights into its application for diverse clustering needs.