NumPy: How to extract N random rows from a large array (4 ways)

Updated: March 1, 2024 By: Guest Contributor Post a comment

Overview

NumPy is a fundamental package for scientific computing in Python. It is widely used for its powerful N-dimensional array objects, and it offers advanced capabilities like broadcasting, vectorized operations, linear algebra routines, and more. This tutorial focuses on how to extract N random rows from a large NumPy array. We’ll explore four different ways to achieve this, starting from basic techniques and moving towards more advanced methods.

Approach #1 – Using np.random.choice

The most straightforward method to extract random rows from a NumPy array involves using the np.random.choice function. This function generates a random sample from a given 1-D array. To select random rows from a 2-D array, you can generate a random sample of indices and then use these indices to index your array.

import numpy as np

data = np.random.rand(100, 5)  # Generate a 100x5 array of random numbers
random_indices = np.random.choice(data.shape[0], 10, replace=False)
random_rows = data[random_indices, :]

print(random_rows)

Output (vary, due to the randomness);

[[0.85464523 0.72358908 0.07677774 0.31245756 0.9179609 ]
 [0.70049004 0.17365916 0.05328865 0.68260541 0.39491402]
 [0.07040867 0.86061469 0.43552088 0.05446339 0.07341297]
 [0.16008213 0.5564603  0.05704451 0.27696696 0.96196632]
 [0.91583693 0.74333554 0.16674873 0.29544847 0.85152656]
 [0.23933403 0.88929775 0.63005529 0.33001907 0.63328781]
 [0.75275824 0.27540506 0.24707123 0.42924852 0.76544332]
 [0.93685099 0.74991963 0.54217951 0.95902067 0.09074765]
 [0.1014044  0.5527204  0.73468465 0.6445147  0.85897175]
 [0.37903288 0.75188152 0.03739945 0.75080648 0.42798774]]

This code snippet generates a 100×5 array filled with random floating-point numbers. It then selects 10 unique rows at random, extracting these rows into random_rows. Here, setting replace=False ensures that each row is selected only once.

Approach #2 – Using np.random.permutation

An alternative approach is to randomly permute the array indices using np.random.permutation and then select the first N rows from this permutation. This method is especially useful when you want to shuffle the entire array but only need a subset of rows.

import numpy as np

data = np.random.rand(100, 5)
permutation = np.random.permutation(data.shape[0])
random_rows = data[permutation[:10], :]

print(random_rows)

Similar to the first method, this approach guarantees that the selected rows are unique. The main advantage of this method is that it implicitly shuffles the entire row indices, providing an evenly distributed selection of rows.

Approach #3 – Using Random Sampling Along with Boolean Indexing

In some cases, you might want a method that allows for more complex selection criteria. By combining random sampling with boolean indexing, you can achieve this level of control. For example, suppose you only want to select rows where the sum of the elements is greater than a certain threshold.

import numpy as np

data = np.random.rand(100, 5)
indices = np.arange(data.shape[0])
selected_rows = np.random.choice(indices[data.sum(axis=1) > 2.5], 10, replace=False)
random_rows = data[selected_rows, :]

print(random_rows)

This technique involves two steps: first, filtering the array to get the indices that meet your criteria, and then selecting random indices from this filtered list. This method offers the flexibility to apply more sophisticated selection criteria while still randomly sampling rows.

Approach #4 – Using np.random.shuffle (in-place shuffle)

Lastly, if you’re working with data where the order of elements can be changed without impacting the analysis (i.e., the dataset does NOT need to remain in its original order), you could perform an in-place shuffle of the rows and then simply select the first N rows. This method is highly efficient because it requires no additional storage for permutations or sampled indices.

import numpy as np

data = np.random.rand(100, 5)
np.random.shuffle(data)
random_rows = data[:10, :]

print(random_rows)

This method directly shuffles the original array, so be cautious using it if the order of your data is meaningful or must be preserved for subsequent operations.

Conclusion

Extracting random rows from a large NumPy array can be achieved through various methods, each with its own use-cases and advantages. Beginning with straightforward random index selection using np.random.choice to more complex selections involving boolean arrays, these methods allow for flexible data analysis and manipulation. The choice of method will largely depend on the specific requirements of your dataset and the complexity of the selection criteria you wish to apply. Utilizing these techniques, NumPy enables efficient and random sampling of large datasets, supporting a wide range of data science and machine learning applications.