Sling Academy
Home/Scikit-Learn/Hierarchical Density-Based Clustering Using HDBSCAN in Scikit-Learn

Hierarchical Density-Based Clustering Using HDBSCAN in Scikit-Learn

Last updated: December 17, 2024

Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) is a clustering algorithm that extends the DBSCAN algorithm by converting it to a hierarchical clustering algorithm. It is especially useful in situations where the dataset might not be perfectly clean, and there's inherent noise. Let's delve into how you can implement HDBSCAN using Python’s Scikit-Learn library.

Installation

Before we start, make sure you have the necessary libraries installed. Besides Scikit-Learn, you need HDBSCAN, which may not be included in the base installation. You can install it using pip:

pip install hdbscan

Ensure you also have NumPy and Matplotlib for data handling and visualization:

pip install numpy matplotlib

Understanding HDBSCAN

HDBSCAN works by converting the point density of a dataset into a hierarchical tree of clusters, with the densest regions forming leaves. It provides a more nuanced insight compared to traditional clustering algorithms by embracing cluster hierarchies and offering a simplified representation.

Step-by-Step Implementation

Let’s go through a step-by-step approach to implement HDBSCAN:

1. Import Libraries

import numpy as np
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import hdbscan

2. Generate Synthetic Data

We will create a simple, synthetic dataset with clear clusters:

# Generate synthetic data
n_samples = 1500
X, _ = make_blobs(n_samples=n_samples, centers=[[0.2, 2.3], [-1.5, -1.5], [3.0, -2.0]], cluster_std=0.5)

3. Visualize the Data

It's always a good idea to look at your data. Let's visualize it:

plt.scatter(X[:, 0], X[:, 1], c='gray', marker='o', s=30)
plt.title("Synthetic Data")
plt.show()

4. Apply HDBSCAN

Now we'll apply HDBSCAN to identify clusters in our synthetic data:

# Apply HDBSCAN
clusterer = hdbscan.HDBSCAN(min_cluster_size=20, min_samples=1)
labels = clusterer.fit_predict(X)

5. Analyze the Result

Each point is assigned a label, and noise is indicated by -1. Let's see the results:

# Visualize clustered data
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='plasma', marker='o', s=30)
plt.title("HDBSCAN Clustering")
plt.show()

Understanding the Output

In the plot, different colors represent different clusters, and points labeled as -1 are considered noise. The effectiveness of HDBSCAN lies in its ability to handle datasets with noise and highlight underlying structure.

Parameters to Tune

Key parameters in HDBSCAN include:

  • min_cluster_size: Minimum number of points necessary to form a cluster. You can adjust this based on your expected cluster sizes.
  • min_samples: Influences the core point sensitivity. Lower values press a stricter requirement for points to be cluster cores.

Practical Applications

HDBSCAN can be applied in various scenarios such as geographical data clustering, market segmentation, and anomaly detection in security logs. Its ability to detect noise makes it highly suitable for real-world data which often contains anomalies or outliers.

Conclusion

HDBSCAN presents a robust framework for clustering when faced with data that exhibits noise and irregular cluster shapes. By leveraging its hierarchical clustering capability, it brings an added depth to the understanding of complex datasets. Integrating HDBSCAN into your data analysis toolkit can offer more adaptability and insight from the clustering process.

Next Article: Scikit-Learn's `KMeans`: A Practical Guide

Previous Article: Feature Agglomeration with Scikit-Learn

Series: Scikit-Learn Tutorials

Scikit-Learn

You May Also Like

  • Generating Gaussian Quantiles with Scikit-Learn
  • Spectral Biclustering with Scikit-Learn
  • Scikit-Learn Complete Cheat Sheet
  • ValueError: Estimator Does Not Support Sparse Input in Scikit-Learn
  • Scikit-Learn TypeError: Cannot Broadcast Due to Shape Mismatch
  • AttributeError: 'dict' Object Has No Attribute 'predict' in Scikit-Learn
  • KeyError: Missing 'param_grid' in Scikit-Learn GridSearchCV
  • Scikit-Learn ValueError: 'max_iter' Must Be Positive Integer
  • Fixing Log Function Error with Negative Values in Scikit-Learn
  • RuntimeError: Distributed Computing Backend Not Found in Scikit-Learn
  • Scikit-Learn TypeError: '<' Not Supported Between 'str' and 'int'
  • AttributeError: GridSearchCV Has No Attribute 'fit_transform' in Scikit-Learn
  • Fixing Scikit-Learn Split Error: Number of Splits > Number of Samples
  • Scikit-Learn TypeError: Cannot Concatenate 'str' and 'int'
  • ValueError: Cannot Use 'predict' Before Fitting Model in Scikit-Learn
  • Fixing AttributeError: NoneType Has No Attribute 'predict' in Scikit-Learn
  • Scikit-Learn ValueError: Cannot Reshape Array of Incorrect Size
  • LinAlgError: Matrix is Singular to Machine Precision in Scikit-Learn
  • Fixing TypeError: ndarray Object is Not Callable in Scikit-Learn