How to Use NumPy for Linear Regression

Updated: January 23, 2024 By: Guest Contributor Post a comment

Introduction

Linear Regression is a fundamental algorithm in machine learning and statistics used to predict the relationship between independent (predictor) variables and a dependent (target) variable. In Python, the NumPy library is a powerful tool for numerical computation that we can leverage to perform linear regression without relying on higher-level libraries such as scikit-learn. This tutorial will guide you through the steps to implement linear regression using NumPy, from basic to more advanced examples. We’ll start with simple linear regression and gradually move on to multiple linear regression, with plenty of code examples to solidify your understanding.

Setting Up Your Environment

Before diving into linear regression, make sure to set up your Python environment and install NumPy. If you haven’t installed NumPy yet, you can do so using pip:

pip install numpy

Once NumPy is installed, you can import it into your Python script or Jupyter notebook:

import numpy as np

Simple Linear Regression

Simple linear regression involves a single independent variable and a dependent variable. The goal is to find the linear relationship represented by the equation y = mx + b, where m is the slope, and b is the y-intercept.

Generating Sample Data

x = np.array([1, 2, 3, 4, 5]) 
y = np.array([2, 4, 5, 4, 5])

Calculating Slope and Intercept

size = len(x)
x_mean = x.mean()
y_mean = y.mean()
covariance = (x - x_mean).dot(y - y_mean)
variance = (x - x_mean).dot(x - x_mean)
slope = covariance / variance
intercept = y_mean - slope * x_mean
print(f'Slope: {slope}, Intercept: {intercept}')
# Output: Slope: 0.6, Intercept: 2.2

In the example above, we calculated the slope and intercept manually using NumPy functions for mean and dot product. We can now use these values to predict future data points or understand our data’s trend.

Multiple Linear Regression

Multiple linear regression refers to a scenario where you have more than one independent variable. The equation is now y = b0 + m1x1 + m2x2 + ... + mnxn, where b0 is the intercept, and m1, m2, ..., mn are the coefficients for each independent variable.

Building a Feature Matrix

X = np.array([[1, 1], [1, 2], [2, 2], [2, 3], [3, 5]]) 
y = np.array([5, 7, 9, 11, 16])

Calculating Coefficients and Intercept Using Norm Equation

X_b = np.c_[np.ones((5, 1)), X]  # Adding x0 = 1 to each instance
beta = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
print(f'Coefficients: {beta[1:]}, Intercept: {beta[0]}')
# Output: Coefficients: [1. 2.], Intercept: 2.9999999999999964

In this example, we form the feature matrix X and the target values y. We then use the normal equation to calculate the coefficients and intercept for our linear regression model. By employing NumPy’s linear algebra module, we can obtain these values quite succinctly.

Regularization with Ridge Regression

When you deal with multiple features, sometimes your model can suffer from multicollinearity or overfitting. One way to mitigate this is by using regularization techniques like Ridge Regression.

Applying Ridge Regression (L2 Regularization)

lambda_param = 1
identity_size = X_b.shape[1]
ridge = np.linalg.inv(X_b.T.dot(X_b) + lambda_param * np.identity(identity_size)).dot(X_b.T).dot(y)
print(f'Ridge Coefficients: {ridge[1:]}, Ridge Intercept: {ridge[0]}')
# Output: Ridge Coefficients: [0.95652174 1.84782609], Ridge Intercept: 3.6956521739130435

This example demonstrates how to apply ridge regression to regularize our model. We introduce a penalty to the size of the coefficients based on a lambda parameter, which helps reduce overfitting.

Advanced: Implementing Gradient Descent

For large datasets, using the normal equation may not be computationally efficient. An alternative method for fitting a linear regression model is using Gradient Descent, an optimization algorithm that iteratively moves towards the minimum of a cost function.

Gradient Descent for Linear Regression

learning_rate = 0.01
iterations = 1000
m, b = 0, 0

for i in range(iterations):
    y_pred = m * x + b
    gradients = -2/x.size * (x.dot(y - y_pred), (y - y_pred).sum())
    m, b = m - learning_rate * gradients[0], b - learning_rate * gradients[1]

print(f'Slope after gradient descent: {m}, Intercept after gradient descent: {b}')
# Output: Slope after gradient descent: 0.6153952126022925, Intercept after gradient descent: 1.8701608785932476

This block of code demonstrates a very simplistic version of Gradient Descent for linear regression with one variable. We iterate over the number of iterations, making incremental adjustments to our slope (m) and intercept (b) by moving in the direction that reduces our cost function, which, in this case, is Mean Squared Error.

Conclusion

In this tutorial, we have explored different ways to implement linear regression using NumPy. From simple to multiple linear regression, we covered the basics as well as regularization with Ridge and the implementation of Gradient Descent for large datasets. Understanding these fundamental operations helps build intuition for more complex machine learning tasks and sets the foundation for further exploration with more sophisticated tools.