How to Optimize NumPy Code Using Just-In-Time Compilation

Overview
1. What is JIT Compilation?
Getting Started with Numba
Understanding Numba’s nopython mode
1. Example: Matrix Multiplication in nopython mode
Using the Numba-guvectorize Decorator
Optimizing Reductions with Numba
Advanced Techniques: Using Numba’s Caching and Typing
Conclusion

Overview

NumPy is a cornerstone of scientific computing in Python, but sometimes its performance isn’t quite up to speed with pure C code, especially for algorithms that can’t be expressed as simple array operations. One powerful way to boost its performance is to use Just-In-Time (JIT) compilation, which compiles Python code into machine code at runtime. In this tutorial, I’ll walk you through optimizing NumPy with JIT compilation using Numba, a JIT compiler that translates a subset of Python and NumPy code into fast machine code.

What is JIT Compilation?

Just-In-Time (JIT) compilation is a dynamic compilation technique that aims to improve the execution speed of a program by compiling code during runtime rather than before execution. This allows for both dynamic typing and high performance, making it especially useful in the context of Python and NumPy.

Getting Started with Numba

First things first, to use JIT with NumPy, we need to install Numba. You can easily do this with pip:

pip install numba

Once installed, you can start using Numba by importing it and using the @numba.jit decorator.

import numba
import numpy as np

@numba.jit
def sum_array(arr):
    result = 0
    for item in arr:
        result += item
    return result

arr = np.arange(1e6)
print(sum_array(arr))

This simple example demonstrates using the @numba.jit decorator to compile a function that sums elements of a NumPy array. When you run this function, you should see a significant speedup compared to the raw Python loop without JIT compilation.

Understanding Numba’s nopython mode

Numba has two modes of compilation: the normal mode and the nopython mode. The nopython mode is stricter but much faster as it avoids the use of Python’s C API. To enforce this mode, use the nopython=True parameter with the JIT decorator.

Example: Matrix Multiplication in nopython mode

import numba
import numpy as np

@numba.jit(nopython=True)
def matrix_multiply(A, B):
    rows_A, cols_A = A.shape
    rows_B, cols_B = B.shape
    result = np.zeros((rows_A, cols_B))
    for i in range(rows_A):
        for j in range(cols_B):
            for k in range(cols_A):
                result[i, j] += A[i, k] * B[k, j]
    return result

A = np.random.rand(100, 100)
B = np.random.rand(100, 100)
print(matrix_multiply(A, B))

The above example defines a matrix multiplication using NumPy arrays. With nopython mode enabled, all operations inside the function must be supported by Numba’s implementation of NumPy functions, which avoids Python object overhead and achieves C-like performance.

Using the Numba-guvectorize Decorator

For operations that are designed to work over axes of a NumPy array (like ufuncs), Numba provides the guvectorize decorator that allows you to define a scalar function and inform Numba of the expected signature of the ufunc. Here’s an example:

import numba
import numpy as np

@numba.guvectorize([(numba.float64[:], numba.float64[:])], '(n)->(n)', nopython=True)
def cumulative_sum(vec, out):
    total = 0
    for i in range(vec.shape[0]):
        total += vec[i]
        out[i] = total

vec = np.array([1.0, 2.0, 3.0, 4.0])
out = np.empty_like(vec)
cumulative_sum(vec, out)
print(out)

With the guvectorize decorator, you describe the expected operation as a mapping from input types to output types. In this example, we’ve implemented a cumulative sum over a one-dimensional array, returning an array of the same shape.

Optimizing Reductions with Numba

Reduction operations reduce the elements of an array to a single value. Using Numba to optimize reductions requires careful treatment to ensure thread-safety and efficient memory usage. Here’s an optimized sum reduction with Numba:

import numba
import numpy as np

@numba.jit(nopython=True)
def sum_reduction(arr):
    total = 0
    for i in numba.prange(len(arr)):
        total += arr[i]
    return total

arr = np.arange(1e6)
print(sum_reduction(arr))

In this case, summing over an array can be parallelized using numba.prange, which indicates to Numba that the loop can be executed in parallel, leading to faster computation on multicore processors.

Advanced Techniques: Using Numba’s Caching and Typing

To achieve even faster run times, you may leverage Numba’s ability to cache compiled functions and specify the function’s signature to avoid the compilation process in future runs:

import numba
import numpy as np

@numba.jit('float64(float64[:])', nopython=True, cache=True)
def sum_with_signature_and_caching(arr):
    total = 0
    for i in arr:
        total += i
    return total

This function declaration helps Numba understand what type of data it will be working with upfront, and it tells Numba to cache its compiled version. This can speed up the loading of the function on subsequent runs of your program.

Conclusion

In this tutorial, we’ve seen how to accelerate NumPy computations with JIT compilation using Numba. From simple loops to complex numerical operations, just-in-time compilation can bring your Python code closer to the metal, untapping greater performance especially for computational heavy tasks.

Next Article: Working with structured arrays in NumPy (with examples)

Previous Article: How to Use NumPy for Deep Learning Model Prototyping

Series: NumPy Intermediate & Advanced Tutorials

NumPy