Introduction
The NumPy library is an essential tool in the Python data science stack, providing support for arrays, matrices, and high-level mathematical functions. Randomness in computations is used for a variety of tasks including simulations, algorithms, and even random sampling of data. However, the need to reproduce experiments or tests necessitates a predictable form of randomness – a paradox that’s resolved by using ‘random seeds’. In this tutorial, we will explore the concept of a random seed and how to work with it through the NumPy library.
Understanding Randomness and Seeds
Randomness in programming is achieved through pseudo-random number generators (PRNGs), which use complex algorithms to produce sequences of numbers that seem random. However, these algorithms actually produce a deterministic sequence that only seems random. To achieve repeatability, we use ‘seeds’ which set the starting point for the sequence. By using the same seed, one ensures that the pseudo-random generator will output the same sequence of ‘random’ numbers every time.
Example 1: Basic Random Seed Usage
import numpy as np
# Set the random seed
np.random.seed(0)
# Generate five random numbers
random_numbers = np.random.random(5)
print(random_numbers)
Output:
[0.5488135 0.71518937 0.60276338 0.54488318 0.4236548]
Using a seed value of 0 consistently reproduces the same array every time the code is executed.
Example 2: Seeding and Data Shuffling
import numpy as np
# Set the random seed
np.random.seed(42)
# Create an array from 0 to 9
data = np.arange(10)
print('Original data:', data)
# Shuffle the data
np.random.shuffle(data)
print('Shuffled data:', data)
Output:
Original data: [0 1 2 3 4 5 6 7 8 9]
Shuffled data: [8 1 5 0 7 2 9 4 3 6]
Even when shuffling the data in the array, the output remains consistent across runs when seeded with the same value.
Random Sampling and Distributions
NumPy offers various functions to generate random samples according to different statistical distributions. Seeding can be particularly useful here to ensure reproducible research or simulations.
Example 3: Random Sampling from a Normal Distribution
import numpy as np
# Set the seed
np.random.seed(7)
# Generate random samples from a normal distribution
samples = np.random.normal(loc=0.0, scale=1.0, size=1000)
# Check the first five samples
print(samples[:5])
Output:
[0.07630829 -1.7813084 -0.35666642 1.77293985 -0.23832635]
This code produces the same normal distribution sample whenever executed with the same seed.
Reproducibility Across Sessions and Systems
Another aspect of random seeds is their importance in maintaining consistency not just in a single environment, but across different systems or between separate computing sessions.
Example 4: Multi-dimensional Array Generation
import numpy as np
# Set the seed
np.random.seed(11)
# Generate a 3x3 matrix of random integers ranging from 0 to 10
matrix = np.random.randint(0, 10, (3, 3))
print(matrix)
Output:
[[9 0 1]
[8 8 3]
[9 8 7]]
This example demonstrates the generation of the same 3×3 matrix across different Python sessions and devices, ensuring reproducibility of an experiment’s conditions or results.
Example 5: Using Random Seed with a Random State Object
NumPy also allows for the creation of a separate random number generator through its RandomState
class. This is particularly handy when dealing with multiple threads or processes to avoid overlapping seeds.
import numpy as np
# Create a new RandomState object with a given seed
rng = np.random.RandomState(29)
# Generate random numbers using the RandomState object
values = rng.standard_normal(10)
print(values)
Output:
[ 0.4274952 0.17499346 -0.91231898 -0.43256247 -1.12280684 0.42007918
0.57192801 -0.41124656 0.6670449 -1.49266338]
Even though NumPy globals remain unaffected, the sequences generated by the RandomState
object are reproducible and isolated using its own seed.
Advanced Usage and Best Practices
When writing complex programs or conducting research, it is essential to note some best practices regarding random seeds.
Seeding in Parallel Computations
Parallel computing introduces random number generation complexities. Seeding must be managed uniquely for each process to avoid identical sequences which could lead to biased results. NumPy’s RandomState
class can be used to assign different seeds to each process.
Changing Seeds
It is sometimes necessary to change the random seed within a program to explore the effects of randomness in analysis. This can be achieved simply by calling np.random.seed
with a different value at the desired point in the code.
Example 6: Changing Seeds Mid-Execution
import numpy as np
# Set the seed
np.random.seed(33)
# Generate initial random numbers
initial_values = np.random.rand(2)
print('Initial values:', initial_values)
# Set a new seed
np.random.seed(44)
# Generate new random numbers post re-seeding
new_values = np.random.rand(2)
print('New values:', new_values)
Output:
Initial values: [0.24851013 0.44997542]
New values: [0.28730852 0.17342911]
The randomness is controlled yet modified by the introduction of a new seed, enabling the comparison of results under varying random sequences.
Conclusion
In scientific and data-driven fields, the control offered by randomized computations with set seeds is invaluable, enabling deterministic reproducibility amidst seemingly stochastic environments. Navigating through NumPy’s random module with an understanding of seeding paves the way for reliable, reproducible research and algorithm diagnostics.