How to Use NumPy for Data Normalization and Preprocessing

Introduction
Understanding NumPy Arrays
Scaling and Normalization Techniques
1. Min-Max Scaling
2. Standardization (Z-Score Normalization)
Processing Data Matrices
1. Batch Normalization with Min-Max Scaling
2. Batch Normalization with Z-Score Normalization
Advanced Normalization Techniques
1. L2 Normalization
2. Max Normalization
Handling Outliers
Conclusion

Introduction

Data normalization is a critical step in data preprocessing, especially in machine learning. Normalization refers to the process of scaling numeric data without distorting differences in the ranges of values. NumPy is a fundamental package for scientific computing in Python that provides a flexible platform for working with data. In this tutorial, we’ll go through how to use NumPy to perform data normalization and preprocessing.

Understanding NumPy Arrays

Before diving directly into normalization, let’s review the basic building block of NumPy – the array. NumPy arrays are grid-like structures that can hold multiple elements of the same data type. These are powerful because of their ability to vectorize operations, thereby speeding up computation.

import numpy as np

# Creating a simple NumPy array
array = np.array([1, 2, 3, 4, 5])
print(array)

Output:

[1 2 3 4 5]

Scaling and Normalization Techniques

Normalization usually involves scaling the features in your data to a range. Common scales include 0-1 range and standard score (Z-score).

Min-Max Scaling

This technique re-scales features to a fixed range of [0, 1].

def min_max_scaling(array):
    min_val = np.min(array)
    max_val = np.max(array)
    scaled_array = (array - min_val) / (max_val - min_val)
    return scaled_array

# Example:
original_data = np.array([10, 20, 30, 40, 50])
scaled_data = min_max_scaling(original_data)
print(scaled_data)

Output:

[0.   0.25 0.5  0.75 1.  ]

Standardization (Z-Score Normalization)

Another common preprocessing technique is Z-score normalization, where the values are rescaled so that they have the properties of a standard normal distribution with μ = 0 and σ = 1.

def z_score_normalization(array):
    mean = np.mean(array)
    std_dev = np.std(array)
    normalized_array = (array - mean) / std_dev
    return normalized_array

# Example:
original_data = np.array([10, 20, 30, 40, 50])
norm_data = z_score_normalization(original_data)
print(norm_data)

Output:

[-1.41421356 -0.70710678  0.          0.70710678  1.41421356]

Processing Data Matrices

In practice, you’ll often deal with matrices rather than individual vectors. NumPy makes it easy to apply these normalization techniques across entire matrices.

Batch Normalization with Min-Max Scaling

def matrix_min_max_scaling(matrix):
    min_vals = np.min(matrix, axis=0)
    max_vals = np.max(matrix, axis=0)
    scaled_matrix = (matrix - min_vals) / (max_vals - min_vals)
    return scaled_matrix

# Example:
matrix = np.array([[1, 400], [2, 300], [3, 200], [4, 100]])
scaled_matrix = matrix_min_max_scaling(matrix)
print(scaled_matrix)

Output:

[[0.         1.        ]
 [0.33333333 0.66666667]
 [0.66666667 0.33333333]
 [1.         0.        ]]

Batch Normalization with Z-Score Normalization

def matrix_z_score_normalization(matrix):
    means = np.mean(matrix, axis=0)
    std_devs = np.std(matrix, axis=0)
    normalized_matrix = (matrix - means) / std_devs
    return normalized_matrix

# Example:
matrix = np.array([[1, 400], [2, 300], [3, 200], [4, 100]])
norm_matrix = matrix_z_score_normalization(matrix)
print(norm_matrix)

Output:

[[-1.34164079  1.34164079]
 [-0.4472136   0.4472136 ]
 [ 0.4472136  -0.4472136 ]
 [ 1.34164079 -1.34164079]]

Advanced Normalization Techniques

While min-max scaling and Z-score normalization are the most common, many other techniques can be handy depending on the data.

L2 Normalization

Also known as Euclidean normalization, this technique scales the input array so that the Euclidean length (L2 norm) is 1.

def l2_normalization(array):
    l2_norm = np.linalg.norm(array)
    normalized_array = array / l2_norm
    return normalized_array

# Example:
original_data = np.array([1, 2, 3])
l2_norm_data = l2_normalization(original_data)
print(l2_norm_data)

Output:

[0.26726124 0.53452248 0.80178373]

Max Normalization

Another approach is to scale the features by dividing each value by the maximum value in its feature.

def max_normalization(array):
    max_val = np.max(array)
    normalized_array = array / max_val
    return normalized_array

# Example:
original_data = np.array([2, 4, 6, 8, 10])
max_norm_data = max_normalization(original_data)
print(max_norm_data)

Output:

[0.2 0.4 0.6 0.8 1. ]

Handling Outliers

When normalizing data, it’s essential to consider the presence of outliers. These extreme values can skew the normalization process, leading to misrepresentative scaling. Techniques such as Robust Scaler can be used, which scales using median and quartile ranges.

A robust scaler example would involve using NumPy to compute the relevant median and interquartile range but is outside the scope of this basic tutorial on NumPy alone and often is solved using higher-level libraries like scikit-learn.

Conclusion

In this tutorial, we covered various techniques to use NumPy for data normalization and preprocessing, highlighted the importance of correctly scaling your data, and provided multiple code examples. The proper preprocessing can have a significant impact on the performance of your machine learning models.

Next Article: How to Use NumPy for Deep Learning Model Prototyping

Previous Article: How to Use NumPy with Jupyter Notebook for Interactive Analysis

Series: NumPy Intermediate & Advanced Tutorials

NumPy