How to Perform Advanced Multivariate Analysis with NumPy

Updated: January 23, 2024 By: Guest Contributor Post a comment

Introduction

Multivariate analysis is a fundamental technique in data science that involves the observation and analysis of more than one statistical outcome variable at a time. Data scientists use it to understand patterns and relationships between multiple variables. Python’s NumPy library is a powerful tool that makes it easy to perform complex numerical computations with efficiency. In this tutorial, we’ll take a deep dive into various advanced multivariate analysis techniques using NumPy.

Preparation

Before we begin, ensure that you have NumPy installed in your Python environment. You can install it using pip:

pip install numpy

Basic Operations with NumPy

First, let’s get familiar with some basic operations in NumPy needed for multivariate analysis.

import numpy as np

# Creating arrays
a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
b = np.array([[9, 8, 7], [6, 5, 4], [3, 2, 1]])

# Element-wise addition
result = a + b
print(result)

The output will be:

[[10 10 10]
 [10 10 10]
 [10 10 10]]

Covariance Matrix

To understand the relationship between multiple variables, calculating the covariance matrix is a standard approach. Below is an example of computing the covariance:

import numpy as np

# Creating a sample dataset
X = np.random.rand(100, 3)  # 100 observations with 3 features

# Calculating the covariance matrix
cov_matrix = np.cov(X.T)  # Transpose to get variables as columns
print(cov_matrix)

The output is a 3×3 matrix representing the covariance between the three features.

Principal Component Analysis (PCA)

PCA is a technique used to emphasize variation and bring out strong patterns in a dataset:

from numpy import linalg as LA

# Subtract the mean from each feature
X_meaned = X - np.mean(X , axis = 0)

# Calculating the covariance matrix of the mean-centered data
cov_mat = np.cov(X_meaned , rowvar = False)

# Eigen decomposition
eigen_values , eigen_vectors = LA.eigh(cov_mat)

This calculation will give the eigenvalues and eigenvectors, which are essential for PCA.

Multiple Linear Regression

NumPy can also be used for multiple linear regression:

# Sample data
X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])  # Features
y = np.dot(X, np.array([1, 2])) + 3   # Dependent variable

# Adding a bias column to the features
X = np.hstack((np.ones((X.shape[0], 1)), X))

# Calculating the coefficients
coefficients = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)
print(coefficients)

These coefficients are the parameters in the regression equation.

Canonical Correlation Analysis (CCA)

CCA allows the analysis of the relationship between two sets of variables. Below is an example of how to perform Canonical Correlation Analysis using NumPy in Python. I’ll include comments to explain each step.

First you need to install scipy:

pip install scipy

The code:

import numpy as np
from scipy.linalg import svd

# Sample data: Two sets of variables X and Y
X = np.array([[1, 2], [3, 4], [5, 6]])
Y = np.array([[6, 5], [4, 3], [2, 1]])

# Center the data
X_centered = X - np.mean(X, axis=0)
Y_centered = Y - np.mean(Y, axis=0)

# Regularization parameter
reg_param = 1e-5

# Compute covariance matrices
C_xx = np.dot(X_centered.T, X_centered) / (X.shape[0] - 1)
C_yy = np.dot(Y_centered.T, Y_centered) / (Y.shape[0] - 1)
C_xy = np.dot(X_centered.T, Y_centered) / (X.shape[0] - 1)

# Add regularization
C_xx += reg_param * np.eye(C_xx.shape[0])
C_yy += reg_param * np.eye(C_yy.shape[0])

# Compute the inverse of the regularized matrices
C_xx_inv = np.linalg.inv(C_xx)
C_yy_inv = np.linalg.inv(C_yy)

# Calculate the matrix for SVD
matrix_for_svd = np.dot(np.dot(C_xx_inv, C_xy), np.dot(C_yy_inv, C_xy.T))

# Perform SVD
U, s, Vt = svd(matrix_for_svd)

# Canonical correlations
canonical_correlations = s

print("Canonical Correlations:", canonical_correlations)

Output:

Canonical Correlations: [9.99997500e-01 5.50638156e-17]

This code snippet demonstrates how to perform Canonical Correlation Analysis using NumPy. It starts with two sets of variables, X and Y. The data is centered by subtracting the mean. Then, covariance matrices are computed, and the singular value decomposition (SVD) is used to obtain the canonical correlations. Canonical correlations represent the relationships between the linear combinations of variables in the two datasets. This method is useful in many fields, including statistics, machine learning, and data analysis, to understand the relationships between two sets of variables.

Conclusion

This tutorial provides a glimpse into how NumPy can be leveraged for performing advanced multivariate analysis, revealing the tip of the iceberg. Clearly, NumPy’s versatility enables data scientists to engage with data more profoundly, and the optimization of numpy ensures the operations are performed with great efficiency.