Understanding Pandas DataFrame.interpolate() method (5 examples)

Updated: February 22, 2024 By: Guest Contributor Post a comment

Introduction

Dealing with missing values is a common pre-processing task in data science and analytics. There are multiple strategies for handling such scenarios – deletion, mean substitution, and interpolation, to name a few. The Pandas Python library, a cornerstone in the data scientist’s toolkit, offers robust capabilities for handling missing data. Among these, the .interpolate() method provides a powerful and versatile way of filling in these gaps based on various interpolation techniques.

This guide walks you through the basics of the Pandas .interpolate() method, gradually advancing to more complex examples. By the end, you’ll have a comprehensive understanding of how this function can be applied to real-world data scenarios.

Basic Linear Interpolation

Linear interpolation is the default strategy of the .interpolate() method. It works by estimating the missing value by connecting the adjacent points with a straight line and using this line as a predictor.

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, np.nan, 3], 'B': [4, 5, np.nan]})
print('Original DataFrame:\n', df)

df_interpolated = df.interpolate()
print('DataFrame after Linear Interpolation:\n', df_interpolated)

Output:

Original DataFrame:
     A    B
 0  1.0  4.0
 1  NaN  5.0
 2  3.0  NaN

DataFrame after Linear Interpolation:

A B
0 1.0 4.0
1 2.0 5.0
2 3.0 5.0

Time Series Interpolation

For time series data, linear interpolation may not always be appropriate, especially when dealing with variable rates of change. Pandas supports datetime-indexed series interpolation, allowing for more context-aware fill-ins.

time_series_df = pd.date_range('20210101', periods=6)
df = pd.DataFrame({'Temperature': [22, np.nan, np.nan, 25, np.nan, 28]}, index=time_series_df)
print('Original Time Series:\n', df)

df_interpolated = df.interpolate(method='time')
print('After Time Series Interpolation:\n', df_interpolated)

Output:

Original Time Series:
             Temperature
 2021-01-01         22.0
 2021-01-02          NaN
 2021-01-03          NaN
 2021-01-04         25.0
 2021-01-05          NaN
 2021-01-06         28.0

After Time Series Interpolation:

Temperature
2021-01-01 22.0
2021-01-02 23.0
2021-01-03 24.0
2021-01-04 25.0
2021-01-05 26.5
2021-01-06 28.0

Polynomial Interpolation

For datasets that show a non-linear pattern, a polynomial interpolation may be more fitting. This technique approximates a polynomial that fits through the points around the missing values.

df = pd.DataFrame({'A': [1, np.nan, 3, 4, np.nan, 6]})
print('Original DataFrame:\n', df)

df_interpolated = df.interpolate(method='polynomial', order=2)
print('DataFrame after Polynomial Interpolation:\n', df_interpolated)

Output:

Original DataFrame:
      A
 0  1.0
 1  NaN
 2  3.0
 3  4.0
 4  NaN
 5  6.0

DataFrame after Polynomial Interpolation:

A
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0

Spline Interpolation

Spline interpolation, particularly useful for smoothing out time series data, leverages piecewise polynomials (splines) to fill in missing values. This method offers a higher degree of smoothing and can be especially useful when dealing with data that has a natural curvature.

df = pd.DataFrame({'A': [1, np.nan, 3, 4, 5, np.nan, 7]})
print('Original DataFrame:\n', df)

df_interpolated = df.interpolate(method='spline', order=3)
print('DataFrame after Spline Interpolation:\n', df_interpolated)

Output:

Original DataFrame:
      A
 0  1.0
 1  NaN
 2  3.0
 3  4.0
 4  5.0
 5  NaN
 6  7.0

DataFrame after Spline Interpolation:

A
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0

Advanced Options: Limit and Direction

Interpolation can also be fine-tuned by setting limits on how many consecutive NaN values to fill and determining the direction of the fill (forward, backward, or both). This can help manage data where only certain gaps need to be filled or to maintain specific structures within the data.

df = pd.DataFrame({'A': [1, np.nan, np.nan, np.nan, 5, np.nan]})
print('Original DataFrame:\n', df)

df_interpolated = df.interpolate(limit=2, limit_direction='forward')
print('DataFrame after Limiting and Direction:\n', df_interpolated)

Output:\

Original DataFrame:
      A
 0  1.0
 1  NaN
 2  NaN
 3  NaN
 4  5.0
 5  NaN

DataFrame after Limiting and Direction:

A
0 1.0
1 2.0
2 3.0
3 NaN
4 5.0
5 NaN

Conclusion

The interpolate() method in Pandas is a versatile tool for handling missing values across a wide array of context – be it a simple linear fill, sophisticated time-based predictions, or curve-fitting exercises with polynomial and spline methods. As with any data pre-processing technique, the choice of method should be influenced by the nature of your dataset and the analytical goals at hand.