How to Implement Gradient Descent Optimization with NumPy

Introduction
Understanding Gradient Descent
Setting Up the Problem
Implementing with NumPy
Advanced Topics
Conclusion

Introduction

Gradient Descent is a fundamental optimization algorithm used in machine learning to minimize loss functions by iteratively moving towards the minimum of a function. This tutorial provides a comprehensive guide on implementing Gradient Descent using NumPy, a powerful library for numerical computing in Python.

Understanding Gradient Descent

Before delving into the code, let’s cover some basics. Gradient Descent is based on the observation that if the multi-variable function F(x) is defined and differentiable in a neighborhood of a point a, then F(x) decreases fastest if one goes from a in the direction of the negative gradient of F at a, - abla F(a). Mathematically, we describe the update equation as:

theta = theta - alpha * gradient_of_loss_function

Where:

theta is the parameter vector,
alpha is the learning rate, a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function,
gradient_of_loss_function is the gradient of the loss function with respect to the parameter vector theta.

Setting Up the Problem

For our example, let’s consider a simple linear regression problem where we try to fit a line to a set of points. The equation of a line is given by y = mx + b, where m is the slope, and b is the y-intercept. Our goal is to find the values of m and b that minimize the mean squared error between our line and the data points.

Implementing with NumPy

NumPy is an excellent choice for implementing algorithms like Gradient Descent due its efficient array operations. The following sections demonstrate the implementation step-by-step.

Step 1: Initialization

import numpy as np

# Example data
X = np.array([1, 2, 3, 4, 5])
y = np.array([5, 7, 9, 11, 13])

# Initializing parameters m (slope) and b (y-intercept) to zero
m, b = 0, 0
learning_rate = 0.01
iterations = 1000

Step 2: Computing the Loss

We will use the mean squared error as our loss function. It is calculated as


def compute_loss(X, y, m, b):
    return np.sum((y - (m*X + b))**2) / len(y)

initial_loss = compute_loss(X, y, m, b)
print(f'Initial loss: {initial_loss}')

Step 3: Updating Parameters

The critical part of Gradient Descent is updating the parameters m and b. To do this, we need to compute the partial derivatives of the loss function with respect to m and b.


def update_params(X, y, m, b, learning_rate):
    N = len(y)
    m_gradient = -(2/N) * np.sum(X * (y - (m * X + b)))
    b_gradient = -(2/N) * np.sum(y - (m * X + b))
    
    m_updated = m - learning_rate * m_gradient
    b_updated = b - learning_rate * b_gradient
    return m_updated, b_updated

Step 4: The Gradient Descent Loop

Now, we’ll put everything together and perform the iterative process of updating our parameters.


for i in range(iterations):
    m, b = update_params(X, y, m, b, learning_rate)
    if i % 100 == 0:
        print(f'After {i+1} iterations, loss: {compute_loss(X, y, m, b)}')

print(f'Final parameters: m={m}, b={b}')

Running the code will show you how the loss decreases as the number of iterations increases, demonstrating that our parameters are converging to values that minimize the loss function.

Advanced Topics

With the basics covered, we can move on to advanced topics like vectorizing the computation for higher efficiency, using advanced optimization techniques such as momentum or implementing stochastic or mini-batch gradient descent. In practical scenarios, these enhancements are crucial for handling larger datasets and more complex models.

Let’s create an example where we use NumPy to implement a vectorized version of mini-batch gradient descent, an advanced optimization technique often used in machine learning. This method is more efficient for handling larger datasets and complex models.

The following code demonstrates a simple linear regression problem where we use mini-batch gradient descent for optimization:

import numpy as np

# Generate some sample data
np.random.seed(0)
X = 2 * np.random.rand(1000, 1)
y = 4 + 3 * X + np.random.randn(1000, 1)

# Add a bias term (intercept term) to X
X_b = np.c_[np.ones((1000, 1)), X]

# Parameters for the gradient descent
learning_rate = 0.01
n_iterations = 50
batch_size = 20
m = len(X_b)

# Initialize theta (parameters) randomly
theta = np.random.randn(2, 1)

# Mini-batch gradient descent
for iteration in range(n_iterations):
    for i in range(0, m, batch_size):
        xi = X_b[i:i+batch_size]
        yi = y[i:i+batch_size]
        
        gradients = 2/batch_size * xi.T.dot(xi.dot(theta) - yi)
        theta -= learning_rate * gradients

# Print the final parameters
print(f"Theta: \n{theta}")

# Optionally, plot the data and the model
import matplotlib.pyplot as plt

plt.plot(X, y, "b.")
X_new = np.array([[0], [2]])
X_new_b = np.c_[np.ones((2, 1)), X_new]
y_predict = X_new_b.dot(theta)
plt.plot(X_new, y_predict, "r-")
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.axis([0, 2, 0, 15])
plt.show()

Here:

Data Generation: We start by generating synthetic data for a simple linear regression problem. This dataset X and y represents our features and labels, respectively.
Mini-batch Gradient Descent Setup: We set our learning rate and number of iterations for the gradient descent. We also define the batch size for the mini-batch gradient descent. The theta vector is initialized randomly.
Gradient Descent Loop: The main loop iterates over the number of iterations. Inside, a nested loop extracts mini-batches from the data and computes the gradient for each batch. The gradient is used to update the parameters (theta).
Vectorization: The use of NumPy’s dot product and broadcasting features allows for efficient computation without the need for explicit for-loops over the dataset or the features, significantly speeding up the computations.
Plotting: After the gradient descent, we plot the original data and our linear model’s predictions. This visualizes how well our model has fit the data.

Conclusion

Gradient Descent is a simple yet powerful optimization technique that can be used across a wide range of problems in machine learning and optimization. Using NumPy to implement Gradient Descent harnesses the power of this library for efficient scientific computation. Following the steps outlined in this guide, practitioners can build upon this foundation for more complicated models and algorithms.

Next Article: How to Create Custom NumPy Functions with Cython

Previous Article: How to Use NumPy in Parallel Computing Scenarios

Series: NumPy Intermediate & Advanced Tutorials

NumPy