NumPy: Random selection with custom probabilities

Introduction
NumPy’s Random Module
Basic Random Selection
Specifying Custom Probabilities
1. Example: Sampling Without Replacement
Advanced Examples
1. Weighted Sampling for Machine Learning
2. Shuffling with Custom Probabilities
Conclusion

Introduction

NumPy is a powerful library for numerical computing in Python. One of its functionalities includes generating random samples from a given array with the option to specify custom probabilities for each element. This feature is particularly useful in scenarios such as simulation, stochastic processes, random sampling for machine learning, and other applications that require non-uniform randomness. This tutorial will explore how to perform random selection with custom probabilities in NumPy and provide various code examples ranging from basic to advanced, including their outputs.

NumPy’s Random Module

Before diving into the examples, it is essential to understand the numpy.random module. This module includes functions that help in generating random numbers or performing random operations. The powerful secret behind random sampling with custom probabilities lies in the choice function.

The general syntax of the choice function is as follows:

numpy.random.choice(a, size=None, replace=True, p=None)

Where:

a: 1-D array-like or int. If an array, a random sample is generated from its elements; if an int, the random sample is generated from np.arange(a).
size: The shape of the output array.
replace: Whether the sample is with (True) or without replacement (False).
p: The probabilities associated with each entry in a. If not provided, the sample assumes a uniform distribution over all entries in a.

Basic Random Selection

The simplest use-case is generating a random sample with a uniform distribution. Let’s see a basic example without custom probabilities:

import numpy as np
arr = np.array([10, 20, 30, 40])
random_sample = np.random.choice(arr, size=10)
print(random_sample)

Output (example):

[30 20 10 40 10 20 30 40 40 20]

In the example above, we selected 10 random items from the array. Since we didn’t specify the p parameter, NumPy assumed a uniform distribution.

Specifying Custom Probabilities

Now, let’s dive into how to specify custom probabilities for each element. Consider the following example:

import numpy as np
arr = np.array(['A', 'B', 'C', 'D'])
custom_probs = np.array([0.1, 0.2, 0.3, 0.4])
random_sample = np.random.choice(arr, size=10, p=custom_probs)
print(random_sample)

Output (example):

["D" "C" "B" "D" "C" "D" "B" "C" "D" "A"]

The output demonstrates the use of the p parameter, where elements ‘A’, ‘B’, ‘C’, and ‘D’ have probabilities of 0.1, 0.2, 0.3, and 0.4, respectively. The resulting random selections should reflect these probabilities over multiple trials.

Example: Sampling Without Replacement

What if you want to do a random selection without replacement? You can set the replace parameter to False. This will ensure each element is only picked once. Here’s how you can do this:

unique_sample = np.random.choice(arr, size=4, replace=False, p=custom_probs)
print(unique_sample)

Output (example):

["C" "B" "D" "A"]

Since size is equal to the number of elements in the array and replace is False, all elements appear exactly once, but their order follows the probability distribution.

Advanced Examples

Let’s discuss some advanced use cases to better understand random selection with custom probabilities.

Weighted Sampling for Machine Learning

When dealing with imbalanced datasets in machine learning, we sometimes need to sample data points with custom probabilities to balance the classes. This might involve oversampling the minority class or undersampling the majority class. Here’s how you can perform weighted sampling:

y = np.array([1, 0, 0, 1, 0, 1])
weights = np.where(y == 0, 0.75, 0.25)
indices = np.random.choice(np.arange(len(y)), size=10, p=weights/weights.sum())
print(indices)

Output (example):

[1 4 3 2 5 1 2 0 5 2]

In this example, we’re much more likely to select indices of the array where the corresponding y value is 0 because they’re given higher weights.

Shuffling with Custom Probabilities

Shuffling can be thought of as sampling without replacement. To shuffle an array using custom probabilities, you can apply random.choice twice. The workflow is to first select a random index based on probabilities and then exclude this index in the next iteration.

# This code creates a shuffled array where the order of shuffling depends on the given probabilities.

import numpy as np

arr = np.array([1, 2, 3, 4, 5])
probabilities = np.array([0.1, 0.1, 0.3, 0.3, 0.2])
shuffled_arr = []

for _ in range(len(arr)):
    index = np.random.choice(np.arange(len(arr)), p=probabilities)
    shuffled_arr.append(arr[index])
    # Remove the selected index from the array
    arr = np.delete(arr, index)
    probabilities = np.delete(probabilities, index)
    # Re-normalize remaining probabilities
    probabilities /= probabilities.sum()

print(shuffled_arr)

Note that this is a simplified conceptual example; using a loop for shuffling is inefficient for large arrays.

Conclusion

In this tutorial, we explored how to perform random selection with custom probabilities using NumPy. We went through several practical examples starting from the basics to more complex and advanced scenarios. By understanding these concepts, you’ll be well-equipped to handle various tasks that require sophisticated methods of random sampling in your data science and machine learning projects.

Next Article: 5 Ways to Compare Two NumPy Arrays (with Examples)

Previous Article: NumPy Random Seed: Explained with examples

Series: NumPy Basic Tutorials

NumPy