NumPy: Generate variates from a multivariate hypergeometric distribution (3 examples)

Updated: March 2, 2024 By: Guest Contributor Post a comment

In this tutorial, we delve into the intricacies of generating variates from a multivariate hypergeometric distribution using NumPy, a foundational package for numerical computing in Python. This type of distribution is crucial in scenarios where we deal with batches or lots containing different types or categories of items, and we’re interested in the probability of drawing specific combinations of these types without replacement.

NumPy provides a convenient method numpy.random.Generator.multivariate_hypergeometric for such purposes, which we will explore through a series of examples, each increasing in complexity to help you grasp the nuances and potential applications of this distribution.

Understanding the Multivariate Hypergeometric Distribution

At its core, the multivariate hypergeometric distribution extends the concept of the classic hypergeometric distribution to more than two categories. Imagine you have a bag with red, blue, and green marbles. If you were to draw a handful of marbles without looking, the probability of picking a specific number of each color would be described by this distribution.

The key parameters include:

  • colors: An array representing the total number of items (marbles, in our example) of each color/type in the population.
  • nsample: The total number of items to draw from the population.

Let’s now dive into the examples.

Example 1: Basic Usage

In this first example, we generate a single random variate from a population consisting of multiple categories.

import numpy as np

rng = np.random.default_rng()  # Create a random number generator instance
colors = [20, 15, 30]  # Total number of red, blue, and green marbles
samples = rng.multivariate_hypergeometric(colors, nsample=10, size=1)
print(samples)

This could result in an output like [4, 3, 3], indicating that in our random draw of 10 marbles, we got 4 red, 3 blue, and 3 green marbles.

Example 2: Reproducibility and Multiple Draws

Next, let’s look at generating multiple variates in a single call, an essential feature for simulations, and ensuring results are reproducible using a fixed seed.

import numpy as np

seed = 42
rng = np.random.default_rng(seed)  # Creating a seeded rng for reproducibility
colors = [25, 15, 20]
samples = rng.multivariate_hypergeometric(colors, nsample=12, size=5)
print(samples)

Output:

[[5 3 4]
 [6 3 3]
 [5 3 4]
 [6 1 5]
 [6 5 1]]

This code will consistently produce the same set of draws, indicating the outcomes of 5 separate draws of 12 marbles.

Example 3: Advanced Applications

Now let’s explore a more advanced scenario where the composition of the population can change over time, a common situation in dynamic systems.

import numpy as np

seed = 42
rng = np.random.default_rng(seed)  # Creating a seeded rng for reproducibility

initial_colors = [40, 30, 50]  # initial state
adjustments = [[-5, +10, -5], [+10, -15, +5]]  # adjustments over time

# Simulate adjustments
for adjustment in adjustments:
    colors = np.add(initial_colors, adjustment)
    samples = rng.multivariate_hypergeometric(colors, nsample=20)
    print("Adjusted Colors: ", colors)
    print("Sample: ", samples)

Output:

Adjusted Colors:  [35 40 45]
Sample:  [5 8 7]
Adjusted Colors:  [50 15 55]
Sample:  [10  1  9]

This displays how the population’s composition affects the outcome, providing insights into managing inventories, ecosystems, or any system with a changing population mix over time.

Conclusion

The multivariate hypergeometric distribution is a powerful tool in NumPy for modeling scenarios involving selections from a heterogeneous population. Through the progression from basic to advanced examples, this tutorial has aimed to demonstrate its utility across various contexts, enriching our statistical and programming techniques.