SciPy: Using cluster.vq.vq() function (3 examples)

Updated: March 4, 2024 By: Guest Contributor Post a comment

Introduction

Cluster analysis, a staple of data science, involves grouping sets of objects in such a way that objects in the same group are more similar to each other than to those in other groups. In this realm, the SciPy library offers powerful tools, one of which is the cluster.vq.vq() function, enabling efficient vector quantization.

Understanding Vector Quantization

Vector Quantization (VQ) is a classical quantization technique from signal processing that allows the modeling of probability density functions by the distribution of prototype vectors. It involves mapping vectors from a high-dimensional space to a finite set of vectors in lower dimensions, with many practical applications in clustering, image compression, and pattern recognition.

The vq() function in SciPy executes this by assigning codes from a code book to observation vectors, effectively partitioning the observation space.

Practical Examples

To work with vq(), ensure you have SciPy installed:

pip install scipy

Example 1: Basic Clustering

Let’s start with a basic example of vector quantization where we have a set of observations and a predefined set of centroids or code book vectors.

import numpy as np
from scipy.cluster.vq import vq, kmeans

# Sample observations
obs = np.array([[1, 2], [3, 4], [5, 6], [7, 8]], dtype=np.float32)

# Performing kmeans to generate a code book
codebook, _ = kmeans(obs, 2)

# Assigning codes to observations
codes, dist = vq(obs, codebook)

print("Code assignments:", codes)
print("Distances:", dist)

Output:

Code assignments: [0 0 1 1]
Distances: [1.4142135 1.4142135 1.4142135 1.4142135]

In this scenario, kmeans() is used to find the centroids, which serve as the code book. The vq() function then assigns each observation to the nearest centroid, returning both the assignments and the distances.

Example 2: Image Quantization

Image quantization is a common use case for VQ. This example shows how to apply it to reduce the color space of an image.

Before diving into the code, you need to install the scikit-image package:

scikit-image

Here’s the code:

from scipy.cluster.vq import vq, kmeans
from skimage import io
from skimage.transform import resize
import numpy as np

# Load and resize an image
# Remember to replace this with your actual image path
image = io.imread('your-image-here.jpg') 
image_resized = resize(image, (image.shape[0] // 4, image.shape[1] // 4),
                        anti_aliasing=True)

# Flatten the image
pixels = np.reshape(image_resized, (-1, 3))

# Compute k-means with k colors
k = 16 # you can modify this value to increase/decrease color quantization
codebook, _ = kmeans(pixels, k)

# Apply VQ to map each pixel to the codebook
quantized, dist = vq(pixels, codebook)

# Reshape quantized data into the original image format
quantized_image = np.reshape(codebook[quantized], image_resized.shape)

# Display the quantized image
io.imshow(quantized_image)
io.show()

In this example, the kmeans() function computes a codebook of 16 colors from the original image pixels. The vq() function then maps each pixel to the nearest color in the codebook, effectively reducing the color space of the image.

Example 3: Custom Feature Quantization for Machine Learning

Quantization can also be applied in the realm of machine learning to simplify feature spaces, reduce model complexity, and potentially improve generalizability. This example demonstrates custom feature quantization for a dataset before employing a machine learning model.

from scipy.cluster.vq import vq, kmeans
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Sample dataset
X = np.random.rand(100, 4)  # 100 samples, 4 features
y = np.random.randint(0, 2, 100)  # Binary target

# Generate codebook for 2 bits quantization
codebook, _ = kmeans(X, 2**2)

# Quantize features
X_quantized, dist = vq(X, codebook)

# Replace quantized indices with their corresponding centroid values
X_centroids = codebook[X_quantized]

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X_centroids, y, test_size=0.2)

# Use a simple logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)
score = model.score(X_test, y_test)

print("Model accuracy with quantized features:", score)

Output (vary):

Model accuracy with quantized features: 0.55

This approach not only simplifies the feature space by quantizing the features into a smaller number of distinct categories but also explores the impact of such simplification on model performance.

Conclusion

The vq() function in SciPy is a formidable tool for performing vector quantization across various domains. From simplifying data through clustering to reducing image color spaces and preparing feature sets for machine learning models, its applications are vast and impactful. As demonstrated, with a basic understanding and creative thinking, one can leverage vq() to explore and enhance data processing and analysis endeavors.