Principal Component Analysis, or PCA, is a statistical technique used in machine learning and data science for dimensionality reduction while preserving as much variability as possible. It’s a tool that transforms the data into a new coordinate system with the most significant features coming first. This tutorial guides you through PCA with the help of Python’s NumPy library.

Understanding the Basics of PCA

Before we get hands-on with NumPy, it’s essential to understand what PCA does. PCA identifies the axes (principal components) that maximize the variance in the data set. It then projects the original data onto these new axes. This process often helps simplify the data, improve algorithm performance, or identify key features in the data.

Getting Started

To perform PCA, we need to install NumPy, which is a powerful library for numerical computations in Python. If you haven’t already, you can install NumPy using pip:

pip install numpy

Sample Data Preparation

For this tutorial, we’ll generate a synthetic data set to work with:

import numpy as np

data = np.array([[2.5, 2.4],
                 [0.5, 0.7],
                 [2.2, 2.9],
                 [1.9, 2.2],
                 [3.1, 3.0],
                 [2.3, 2.7],
                 [2, 1.6],
                 [1, 1.1],
                 [1.5, 1.6],
                 [1.1, 0.9]])

Step 1: Standardize the Data

PCA is affected by scale, so you need to scale the features in the data before applying PCA:

mean = np.mean(data, axis=0)
std_dev = np.std(data, axis=0)
data_std = (data - mean) / std_dev

Step 2: Compute the Covariance Matrix

The next step is to compute the covariance matrix of the standardized data:

cov_matrix = np.cov(data_std.T)

Step 3: Calculate Eigenvalues and Eigenvectors

Compute the eigenvalues and eigenvectors of the covariance matrix to identify the principal components:

values, vectors = np.linalg.eig(cov_matrix)

Sorting eigenvalues and their corresponding eigenvectors:

sorted_indices = np.argsort(values)[::-1]
values_sorted = values[sorted_indices]
vectors_sorted = vectors[:,sorted_indices]

Step 4: Project the Data onto Principal Components

Now it’s time to reduce the dimensionality by projecting the original data onto the principal components:

pca_components = vectors_sorted[:, :2]
projected_data = np.dot(data_std, pca_components)

Advanced PCA with Sigmoid Normalization

In certain cases, you might want to apply a non-linear normalization such as the sigmoid function before applying PCA:

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

Applying the sigmoid normalization to standard data:

data_sig = sigmoid(data_std)

Repeat steps 2 to 4 with data_sig as the input to observe the differences.

Visualizing PCA Results

It’s often helpful to visualize the results of PCA to understand the transformation and the explained variance. Here’s how we might use matplotlib to do this:

import matplotlib.pyplot as plt

plt.scatter(projected_data[:, 0], projected_data[:, 1])
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA Result')
plt.grid(True)
plt.show()

Applying PCA to Real-World Datasets

To extend what you’ve learned, you can perform PCA on a real-world dataset, such as the Iris dataset. Libraries like scikit-learn offer PCA out-of-the-box, but doing it manually with NumPy provides a more in-depth understanding.

Load the Iris dataset, repeat the steps above, and interpret the result. Is there a clear separation between different species of Iris flowers when you project the data onto the first two principal components?

Combining PCA with Machine Learning Models

PCA is commonly used as a preprocessing step in machine learning. By reducing the dimensionality of feature sets, it can help improve the speed and performance of learning algorithms. Experiment with incorporating PCA into your next machine learning workflow and observe the differences.

Conclusion

Throughout this tutorial, you’ve learned how to perform PCA using NumPy from basic methods to more advanced techniques. You’ve also explored how to visualize and apply PCA to real-world data. As dimensionality reduction is a powerful technique, mastering PCA can significantly enhance your data analysis skills.

Next Article: How to Perform Advanced Array Indexing in NumPy

Previous Article: Understanding numpy.array_split() function (4 examples)

Series: NumPy Intermediate & Advanced Tutorials

NumPy