Imputing Missing Values with Scikit-Learn's `SimpleImputer`

Handling missing data is a common challenge when working with real-world datasets. Missing values can arise due to various reasons like human errors, system failures, or unrecorded values, and they may significantly hinder data analysis or machine learning model performance. In Python, Scikit-Learn provides a robust and convenient way to deal with missing data through the SimpleImputer class.

What is SimpleImputer?
Installation
Example: Using SimpleImputer
Conclusion

What is `SimpleImputer`?

The SimpleImputer class is part of Scikit-Learn's imputation module specifically designed to replace missing values in a dataset using a specified strategy. The strategies can include replacing missing values with the mean, median, or most frequent value (mode) of the respective feature or replacing them with a constant value.

Installation

Before diving into examples, let's ensure that you have Scikit-Learn installed in your environment. You can do this using pip:

pip install scikit-learn

Example: Using `SimpleImputer`

Let’s look at how we can use SimpleImputer to handle missing values in a dataset. We’ll simulate a dataset with missing values and apply different strategies using SimpleImputer.

1. Imputing with the Mean Value

In this example, we replace missing numerical data with the mean value of each feature.

from sklearn.impute import SimpleImputer
import numpy as np

# Simulate data
data = np.array([[1, 2], [np.nan, 3], [7, 6], [null, 8]])

# Initialize the SimpleImputer with strategy='mean'
imputer = SimpleImputer(strategy='mean')

# Fit the imputer and transform the data
imputed_data = imputer.fit_transform(data)
print(imputed_data)

In the output, missing values are replaced with the mean of their respective columns.

2. Imputing with the Median Value

To replace missing values with the median, set the `strategy` parameter to 'median'. This is often a better choice for numeric data, especially when features have outliers.

# Initialize the SimpleImputer with strategy='median'
imputer = SimpleImputer(strategy='median')

# Transform the data
imputed_data = imputer.fit_transform(data)
print(imputed_data)

3. Imputing with the Most Frequent Value

For categorical data, replacing missing values with the most frequent (or most common) value is a common strategy.

# Initialize the SimpleImputer with strategy='most_frequent'
imputer = SimpleImputer(strategy='most_frequent')

# Transform the data
imputed_data = imputer.fit_transform(data)
print(imputed_data)

4. Imputing with a Constant Value

If you want to replace missing values with a specific value, use the constant strategy along with the fill_value parameter.

# Initialize the SimpleImputer with strategy='constant' and fill_value=0
imputer = SimpleImputer(strategy='constant', fill_value=0)

# Transform the data
imputed_data = imputer.fit_transform(data)
print(imputed_data)

Conclusion

The SimpleImputer class in Scikit-Learn provides a straightforward API for filling missing data with various strategies. Choosing the appropriate strategy often depends on the characteristics of the data and the domain knowledge. By accounting for missing values, you can ensure that your dataset is more complete and your machine learning models perform better.

Next Article: Partial Dependence Plots with Scikit-Learn's `PartialDependenceDisplay`

Previous Article: Gaussian Process Regression with Scikit-Learn

Series: Scikit-Learn Tutorials

Scikit-Learn