NumPy: How to find common values between two arrays

Updated: January 23, 2024 By: Guest Contributor Post a comment

Introduction

Working with numerical data in Python often necessitates the use of NumPy, a powerful library that provides high-performance multidimensional array objects and tools for working with these arrays. A common task you might encounter is finding the intersection of two arrays, that is, the set of elements common to both arrays. In this comprehensive guide, we’ll explore how to achieve this using NumPy, showcasing basic to advanced techniques complete with code examples.

Getting Started

To get started, you need to have NumPy installed. If it’s not already on your system, you can install it using pip:

pip install numpy

Once installed, you can import NumPy and start using it:

import numpy as np

Finding Common Values: Basics

Let’s start with the simplest scenario – you have two 1D arrays and you want to find the common elements. NumPy provides the np.intersect1d function that returns the sorted, unique values that are in both of the input arrays:

import numpy as np 

a = np.array([1, 2, 3, 4, 5]) 
b = np.array([4, 5, 6, 7, 8]) 

print(np.intersect1d(a, b))

The output of this code will be:

[4, 5]

It’s important to note that intersect1d first finds the common elements, and then returns the sorted unique values of these. Thus, even if there are duplicates in the original arrays, you won’t find duplicates in the result.

Working with Multidimensional Arrays

Now, suppose you’re working with multidimensional arrays. The np.intersect1d function can still be used since it flattens the input arrays before computing the intersection. Here’s an example:

a = np.array([[1, 2, 3], [4, 5, 6]]) 
b = np.array([[4, 5, 9], [7, 8, 6]]) 

print(np.intersect1d(a, b))

Output:

[4, 5, 6]

Note that the result remains a 1D array regardless of the dimensionality of the inputs.

Preserving Order

If maintaining the original order is important, you might need a workaround since np.intersect1d returns sorted values. One approach is to use a combination of NumPy functions to find common values while preserving the order of the first input array:

import numpy as np

def ordered_intersect(a, b):
    common = np.intersect1d(a, b)
    return np.array([x for x in a if x in common])

a = np.array([1, 2, 3, 4, 5])
b = np.array([4, 5, 6, 3, 2])

print(ordered_intersect(a, b))

The output respects the order of array a:

[2, 3, 4, 5]

Finding Common Values with Conditions

Sometimes, you might want to find common elements based on a certain condition. This can be done with the aid of boolean indexing and the np.isin method. For example, to get common elements that are even:

a = np.array([1, 2, 3, 4, 5])
b = np.array([4, 5, 6, 7, 8])

common = np.intersect1d(a, b)
even_common = common[common % 2 == 0]

print(even_common)

Output:

[4]

Advanced Techniques

Custom Comparison Functions

If you need more complex criteria for determining commonalities in arrays which cannot be handled by np.intersect1d, you may need to implement custom logic. For example, consider finding common elements within a certain numerical tolerance:

def within_tolerance(a, b, tol=0.1):
    result = []
    for x in a:
        for y in b:
            if abs(x - y) < tol:
                result.append(x)
                break
    return np.array(result)

a = np.array([1.01, 2.02, 3.03])
b = np.array([1.00, 2.05])

print(within_tolerance(a, b))

Output:

[1.01 2.02]

Note that this function will be slower due to the explicit loops, and thus it’s not suitable for large arrays. However, it provides flexibility that the built-in NumPy functions don’t offer.

Hashing for Large Arrays

When dealing with large arrays, it is often more efficient to use a hashing technique. This way, you can avoid the quadratic complexity of nested loops. Python’s built-in set data structure can be utilized to speed up the process, at the cost of losing the built-in NumPy functionality and optimizations:

def fast_intersect(a, b):
    a_set = set(a)
    b_set = set(b)
    return np.array(list(a_set & b_set))

a = np.random.randint(0, 100000, size=10000)
b = np.random.randint(0, 100000, size=10000)

print(fast_intersect(a, b))

This function is much more efficient but does not guarantee that the results will be in the order of any original array nor that they will be sorted.

Conclusion

In this tutorial, we’ve learned how to find common values between two NumPy arrays using a variety of techniques. We’ve seen that np.intersect1d is suitable for most cases but also explored ways to preserve array order and how to handle custom comparison logic. Whether working with small or large data sets, NumPy offers tools and methods to facilitate the process of identifying overlapping data.