Partial Least Squares Regression in Scikit-Learn

Partial Least Squares (PLS) regression is a statistical method that is primarily used to model relationships between datasets by projecting the predictor and response variables into a new space. It's particularly useful for dealing with multicollinear and high-dimensional data environments, which are common in fields like genomics or chemometrics. In this article, we're going to explore how you can implement PLS regression using Scikit-Learn, a widely-used open-source library in Python for machine learning.

The key benefit of PLS is its ability to handle datasets where the predictors are highly collinear, which would be problematic for methods like ordinary least squares regression. PLS finds components (also called "latent variables") from predictors that also take into account variation with response variables.

Getting Started
Building the PLS Model
Conclusion

Getting Started

First, ensure you have Scikit-Learn installed. You can install it using pip:

pip install scikit-learn

Next, let's dive into setting up a basic Partial Least Squares Regression model with Scikit-Learn. We'll start by importing necessary libraries and then generate some synthetic data:

from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import matplotlib.pyplot as plt

# Create synthetic data
np.random.seed(0)
X = np.random.normal(size=(100, 10))
# Create response variable with some noise
Y = np.random.normal(size=(100,)) + np.dot(X, np.random.normal(size=(10,)))

With the data created, let's split it into a training set and a test set:

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

Building the PLS Model

Now it's time to create the PLS model. We will consider the number of components (i.e., latent variables) to be an important hyperparameter to tune. Here, we set it to 2 for demonstration, but it's something you should validate with cross-validation in real-world scenarios:

pls = PLSRegression(n_components=2)
pls.fit(X_train, Y_train)

After fitting the model, we can perform predictions:

Y_pred = pls.predict(X_test)
# Calculate mean squared error
mse = mean_squared_error(Y_test, Y_pred)
print(f'Mean Squared Error: {mse:.4f}')

You'll notice PLS not only seeks the same number of components, it uses them in transforming the predictors and responses which can lead to robust multimodal projections. Let’s visualize the model's fit:

plt.scatter(Y_test, Y_pred)
plt.xlabel('True Values')
plt.ylabel('Predicted Values')
plt.title('PLS Regression Predictions')
plt.plot([min(Y_test), max(Y_test)], [min(Y_test), max(Y_test)], color='red')
plt.show()

Conclusion

PLS regression is a powerful methodology when dealing with high-dimensional datasets with collinear variables. Thanks to Scikit-Learn's easy-to-use interface, implementing PLS can be intuitive and straightforward. By incorporating cross-validation, practitioners can further ensure that they've selected the optimal number of components, thereby improving the model's generalization performance. Experiment with different numbers of components and delve into understanding the latent variables for a more informed application of PLS regression. Remember, this is particularly helpful in fields requiring interpretation across multiple varying factors such as chemistry and biology.

Next Article: Dumping and Loading Datasets with Scikit-Learn's `dump_svmlight_file`

Previous Article: Performing Canonical Correlation Analysis (CCA) with Scikit-Learn

Series: Scikit-Learn Tutorials

Scikit-Learn