NumPy: How to group by unique values in an array

Updated: January 23, 2024 By: Guest Contributor Post a comment

Overview

NumPy is a fundamental library for scientific computing in Python. It provides tools for working with arrays that are not only efficient but also flexible. Among these tools is the ability to group by unique values in an array, a common operation in data processing and analysis. This tutorial will guide you through how to perform such groupings with multiple code examples.

Getting Started

First off, ensure that NumPy is installed in your Python environment. You can install it using pip:

pip install numpy

Once installed, import NumPy to get started:

import numpy as np

Basic Grouping by Unique Values

Let’s find the unique values in a NumPy array and then group the array by those values.

# Example array
array = np.array([1, 2, 3, 2, 1, 3, 4])

# Find the unique values
unique_values, counts = np.unique(array, return_counts=True)

# Display the unique values and their counts
print(unique_values)
print(counts)

The output will be:

[1 2 3 4]
[2 2 2 1]

We’ve obtained the unique values and their respective counts in the array.

Advanced Grouping with Indices

NumPy’s np.unique() can also return the indices of the group elements which can be used to reconstruct the original array or create group arrays. Let’s see how:

# Continue using array from the previous example

# Get the indices of unique values
unique_values, inverse_indices = np.unique(array, return_inverse=True)

# Use the indices to show grouped data
groups = [[] for _ in unique_values]
for index, value in enumerate(array):
    groups[inverse_indices[index]].append(value)

# Display the grouped data
for val, group in zip(unique_values, groups):
    print('Value:', val, 'Group:', group)

The output will look like this:

Value: 1 Group: [1, 1]
Value: 2 Group: [2, 2]
Value: 3 Group: [3, 3]
Value: 4 Group: [4]

We’ve successfully grouped array elements based on unique values.

Group by Unique with Multidimensional Arrays

Grouping becomes trickier with multidimensional arrays. We have to flatten the array or use a structured array to handle dimensions correctly.

Using Flattening

We can flatten the array and apply the same concepts as before:

# Example multidimensional array
multi_array = np.array([[1, 2], [2, 3], [1, 3]])

# Flatten the array
flat_array = multi_array.flatten()

# Apply the unique function
unique_values, counts = np.unique(flat_array, return_counts=True)

# Display the unique values and their counts
print(unique_values)
print(counts)

Output:

[1 2 3]
[2 3 3]

Using Structured Arrays

For multidimensional arrays where grouping should maintain the structure, consider a structured approach:

# Define a structured array with named fields
structured_array = np.array([(1, 2), (2, 3), (1, 3)], dtype=[('f0', 'i4'), ('f1', 'i4')])

# Apply the unique function to each column
unique_first_column = np.unique(structured_array['f0'])

# Display the unique values of the first column
print('Unique values in the first column:', unique_first_column)

Output:

Unique values in the first column: [1 2]

Aggregation on Grouped Data

Once grouped, we might want to perform some aggregation. Here’s how we can sum the counts of values:

# Continue using array from earlier examples

# Sum the counts of each unique value
summed_counts = np.bincount(array)

# Filter out zeros that represent sums for non-existing values
summed_counts = summed_counts[unique_values]

# Display the sums
for val, count_sum in zip(unique_values, summed_counts):
    print('Value:', val, 'Sum:', count_sum)

Sample output may look like this:

Value: 1 Sum: 2
Value: 2 Sum: 4
Value: 3 Sum: 6
Value: 4 Sum: 4

In this more advanced example, we used np.bincount() which is an efficient way to sum the counts per unique value.

Conclusion

We’ve walked through various techniques for grouping by unique values in NumPy arrays both for 1D and multidimensional cases. As you can see, NumPy provides flexible and efficient ways to work with large datasets, simplifying tasks that would otherwise be cumbersome. Applying these methods to data analysis tasks can greatly streamline your workflow.