Sling Academy
Home/Scikit-Learn/Imputing Missing Values with Scikit-Learn's `SimpleImputer`

Imputing Missing Values with Scikit-Learn's `SimpleImputer`

Last updated: December 17, 2024

Handling missing data is a common challenge when working with real-world datasets. Missing values can arise due to various reasons like human errors, system failures, or unrecorded values, and they may significantly hinder data analysis or machine learning model performance. In Python, Scikit-Learn provides a robust and convenient way to deal with missing data through the SimpleImputer class.

What is SimpleImputer?

The SimpleImputer class is part of Scikit-Learn's imputation module specifically designed to replace missing values in a dataset using a specified strategy. The strategies can include replacing missing values with the mean, median, or most frequent value (mode) of the respective feature or replacing them with a constant value.

Installation

Before diving into examples, let's ensure that you have Scikit-Learn installed in your environment. You can do this using pip:

pip install scikit-learn

Example: Using SimpleImputer

Let’s look at how we can use SimpleImputer to handle missing values in a dataset. We’ll simulate a dataset with missing values and apply different strategies using SimpleImputer.

1. Imputing with the Mean Value

In this example, we replace missing numerical data with the mean value of each feature.

from sklearn.impute import SimpleImputer
import numpy as np

# Simulate data
data = np.array([[1, 2], [np.nan, 3], [7, 6], [null, 8]])

# Initialize the SimpleImputer with strategy='mean'
imputer = SimpleImputer(strategy='mean')

# Fit the imputer and transform the data
imputed_data = imputer.fit_transform(data)
print(imputed_data)

In the output, missing values are replaced with the mean of their respective columns.

2. Imputing with the Median Value

To replace missing values with the median, set the `strategy` parameter to 'median'. This is often a better choice for numeric data, especially when features have outliers.

# Initialize the SimpleImputer with strategy='median'
imputer = SimpleImputer(strategy='median')

# Transform the data
imputed_data = imputer.fit_transform(data)
print(imputed_data)

3. Imputing with the Most Frequent Value

For categorical data, replacing missing values with the most frequent (or most common) value is a common strategy.

# Initialize the SimpleImputer with strategy='most_frequent'
imputer = SimpleImputer(strategy='most_frequent')

# Transform the data
imputed_data = imputer.fit_transform(data)
print(imputed_data)

4. Imputing with a Constant Value

If you want to replace missing values with a specific value, use the constant strategy along with the fill_value parameter.

# Initialize the SimpleImputer with strategy='constant' and fill_value=0
imputer = SimpleImputer(strategy='constant', fill_value=0)

# Transform the data
imputed_data = imputer.fit_transform(data)
print(imputed_data)

 

Conclusion

The SimpleImputer class in Scikit-Learn provides a straightforward API for filling missing data with various strategies. Choosing the appropriate strategy often depends on the characteristics of the data and the domain knowledge. By accounting for missing values, you can ensure that your dataset is more complete and your machine learning models perform better.

Next Article: Partial Dependence Plots with Scikit-Learn's `PartialDependenceDisplay`

Previous Article: Gaussian Process Regression with Scikit-Learn

Series: Scikit-Learn Tutorials

Scikit-Learn

You May Also Like

  • Generating Gaussian Quantiles with Scikit-Learn
  • Spectral Biclustering with Scikit-Learn
  • Scikit-Learn Complete Cheat Sheet
  • ValueError: Estimator Does Not Support Sparse Input in Scikit-Learn
  • Scikit-Learn TypeError: Cannot Broadcast Due to Shape Mismatch
  • AttributeError: 'dict' Object Has No Attribute 'predict' in Scikit-Learn
  • KeyError: Missing 'param_grid' in Scikit-Learn GridSearchCV
  • Scikit-Learn ValueError: 'max_iter' Must Be Positive Integer
  • Fixing Log Function Error with Negative Values in Scikit-Learn
  • RuntimeError: Distributed Computing Backend Not Found in Scikit-Learn
  • Scikit-Learn TypeError: '<' Not Supported Between 'str' and 'int'
  • AttributeError: GridSearchCV Has No Attribute 'fit_transform' in Scikit-Learn
  • Fixing Scikit-Learn Split Error: Number of Splits > Number of Samples
  • Scikit-Learn TypeError: Cannot Concatenate 'str' and 'int'
  • ValueError: Cannot Use 'predict' Before Fitting Model in Scikit-Learn
  • Fixing AttributeError: NoneType Has No Attribute 'predict' in Scikit-Learn
  • Scikit-Learn ValueError: Cannot Reshape Array of Incorrect Size
  • LinAlgError: Matrix is Singular to Machine Precision in Scikit-Learn
  • Fixing TypeError: ndarray Object is Not Callable in Scikit-Learn