How to Handle Large Arrays with NumPy’s Memory Mapping

Updated: January 23, 2024 By: Guest Contributor Post a comment

Introduction

Dealing with large datasets is a common challenge in data analysis and machine learning. Holding the entire dataset in memory can be impractical or impossible due to hardware limitations. This is where memory mapping comes into play, and NumPy, a fundamental package for scientific computing in Python, offers a feature known as memory-mapped arrays that enables you to work with arrays too large for your system’s memory.

This tutorial will cover what memory mapping is, how NumPy implements it, and demonstrate through examples how you can utilize this feature for efficient data processing.

Understanding Memory Mapping

Memory mapping is a way of mapping a portion of a file on disk to a region of memory. This means that you can access and process parts of your data using the same syntax as if it was all being held in RAM, without actually loading it all into memory at once. The operating system transparently loads the relevant portions of the file into memory as they are accessed and writes changes back to the file on disk.

Getting Started with NumPy’s Memory Mapping

Before we start manipulating large datasets with memory-mapped arrays in NumPy, you need to have NumPy installed. If you haven’t already installed it, you can do so with the following command:

pip install numpy

Let’s start with a basic example of memory mapping in NumPy. First, we will create a memory-mapped array from scratch and write some data to it. After that, we’ll read this data back.

import numpy as np

# Define the shape and the data type of the array
shape = (1000, 1000)
dtype = np.int64

# Create a memory-mapped array with zeros
fp = np.memmap('mmaped.dat', dtype=dtype, mode='w+', shape=shape)

# Assining values to a segment
fp[0:100,:] = np.random.randint(0, 100, (100, 1000))

# Flushing memory changes to disk
fp.flush()

# Deallocating the memory-mapped object
del fp

# Open the memory-mapped file in read mode
new_fp = np.memmap('mmaped.dat', dtype=dtype, mode='r', shape=shape)

# Reading values
print(new_fp[0:100,:])

In the code above, we first import NumPy and define the shape and datatype for our array. We then create a memory-mapped file with the ‘w+’ mode, which says that a new file will be created (if it doesn’t exist), and we can read and write to the file. We assign random values to the first 100 rows and then call fp.flush() to ensure all changes are saved to disk before we deallocate the memory-mapped array object. Lastly, we read from the memory-mapped file to retrieve and print the data we wrote earlier.

Processing Parts of Large Arrays

One of the biggest advantages of memory-mapped arrays is that you can manipulate parts of large arrays without worrying that your system will run out of memory. Here we’ll see how to process a segment of a large array efficiently.

import numpy as np

# Assuming 'mmaped.dat' already exists
large_shape = (1e6, 1e6)
dtype = np.float64

# Open existing memory-mapped array
large_fp = np.memmap('mmaped.dat', dtype=dtype, mode='r+', shape=large_shape)

# Compute the mean of a part of the array
mean_value = large_fp[500000:500100, 500000:500100].mean()
print('Mean value:', mean_value)

In the above code, we open an existing memory-mapped file in ‘r+’ mode, which allows us to read and write without creating a new file. Then we select a slice of the memory-mapped array and compute its mean. Because we’re using memory mapping, this operation won’t consume memory for the entire array; instead, it only uses what’s needed for the selected portion.

Advanced Operations

Memory mapped arrays support most of the operations you can perform on regular NumPy arrays. Here’s a more advanced example that involves filtering parts of a large array based on a condition and then performing a computation on this filtered data.

import numpy as np

# Filtering and computing on a memory mapped array
large_fp = np.memmap('mmaped.dat', dtype=np.float64, mode='r+', shape=(1e6, 1e6))

# Applying a condition on the array and sum up the values
filtered_sum = large_fp[large_fp > 0.5].sum()
print('Filtered sum:', filtered_sum)

In this example, we apply a boolean condition to the memory-mapped array, selecting elements that are greater than 0.5, and then sum these elements. While the condition creates a temporary array, it’s managed efficiently by NumPy’s internal mechanisms and doesn’t lead to the entire dataset being loaded into memory.

Perfomance Considerations

It is important to note that, although memory-mapping can significantly reduce the memory footprint, it’s not without its trade-offs. Disk I/O tends to be much slower than memory I/O, so you should expect some operations to take longer when performed on a memory-mapped array compared to an in-memory array, especially if you are working with very large datasets that cause frequent reading from and writing to disk.

Conclusion

NumPy’s memory mapping provides a powerful tool for working with datasets that are too large to fit into memory. By using this feature, we can manipulate these datasets almost as if they were entirely in-memory arrays, but with a much smaller memory footprint. This can enable more complex analyses and models to be run on standard hardware, maximising your system’s capabilities.