In data science and numerical computing, it’s common to work with large datasets represented as two-dimensional arrays. Often, we need to find the unique rows or columns in this form of data to simplify analysis, remove duplicates, or understand the variety within the dataset. This tutorial explores practical methods to find unique rows and columns using NumPy, a powerful numerical processing library in Python.

Prerequisites: Before jumping into the code, make sure you have Python and NumPy installed. Python can be downloaded from python.org and NumPy can be installed using pip, which is Python’s package installer. Simply type the following in your command line:

pip install numpy

Understanding NumPy Arrays

NumPy arrays, or ndarrays, are similar to Python’s built-in lists, but they are faster and more suited to mathematical computation because they are implemented in C and are statically typed. Before we tackle the problem of finding unique rows or columns, it’s essential to understand how to create and manipulate these arrays.

import numpy as np

# Creating a 2D array
a = np.array([[1, 2], [3, 4], [1, 2]])
print(a)

Finding Unique Rows

To find the unique rows in a 2D NumPy array, we can use the np.unique() function with the axis parameter. The following code illustrates this:

import numpy as np

# Sample 2D array
arr = np.array([[1, 2], [3, 4], [1, 2], [2, 3]])

# Finding unique rows
unique_rows = np.unique(arr, axis=0)
print(unique_rows)

When the axis parameter is set to 0, the function identifies rows that are unique along the first dimension, effectively returning the unique rows in the array.

Including the Inverse and Counts

NumPy’s np.unique() function can also provide the inverse mapping and the counts of the unique rows. This can be done by setting the return_inverse and return_counts parameters.

import numpy as np

# Sample 2D array
arr = np.array([[1, 2], [3, 4], [1, 2], [2, 3]])

# Finding unique rows with inverse and counts
unique_rows, indices, counts = np.unique(arr, axis=0, return_inverse=True, return_counts=True)

print('Unique Rows:\n', unique_rows)
print('Indices: ', indices)
print('Counts: ', counts)

The indices array provides information on how to reconstruct the original array with the unique rows, while counts indicates how many times each unique row appeared in the original array.

Finding Unique Columns

Finding unique columns in a 2D array is done similarly to finding unique rows. However, you set the axis parameter to 1.

import numpy as np

# Sample 2D array
arr = np.array([[1, 1, 2], [2, 1, 2], [3, 1, 3]])

# Finding unique columns
unique_columns = np.unique(arr, axis=1)
print(unique_columns)

This approach will return the unique columns, which correspond to the unique elements found along the second dimension of the array.

Working with Structured Arrays

Real-world data often requires us to work with rows as records rather than individual fields; such behavior can be emulated using NumPy’s structured arrays.

import numpy as np

# Creating a structured array
struct_arr = np.array([(1, 'Apple'), (2, 'Orange'), (1, 'Apple')],
dtype=[('id', 'i4'), ('name', 'U10')])

# Finding unique records
unique_struct_arr = np.unique(struct_arr)
print(unique_struct_arr)

This finds unique rows treating each row as a unique record based on the composite data types defined.

Advanced: Using Custom Rows

For even more control, you can compare elements within the rows to determine uniqueness using advanced indexing and broadcasting.

import numpy as np

# Defining a custom function for finding unique rows
def unique_rows_custom(arr):
    arr_view = arr.view([('', arr.dtype)] * arr.shape[1])
    _, unique_indices = np.unique(arr_view, return_index=True)
    return arr[unique_indices]

# Sample 2D array
arr = np.array([[1, 1, 1], [2, 2, 2], [1, 1, 1]])

# Finding unique rows
unique_rows = unique_rows_custom(arr)
print(unique_rows)

This approach uses the view and unique functions in a powerful way to determine unique rows, particularly when the built-in behaviors of np.unique() don’t suffice.

Conclusion

Learning how to find unique rows and columns in NumPy arrays is an essential skill for data manipulation and pre-processing. By following the methods outlined above, you can effectively manage the uniqueness constraint on your data. NumPy’s flexibility and tools equip you with various ways to analyze and process your data efficiently.

Next Article: How to Use NumPy for Simple Statistical Analysis

Previous Article: NumPy ufunc.signature attribute: Explained with examples

Series: NumPy Basic Tutorials

NumPy