Exploring pandas.Series.combine_first() method (with examples)

Updated: February 17, 2024 By: Guest Contributor Post a comment

Overview

Pandas is a formidable tool in the data science ecosystem, enabling data manipulation and analysis with ease. Especially, when dealing with missing data, methods like combine_first() come in handy. This tutorial dives into the combine_first() method of the pandas Series object, elucidating its nuances through practical examples.

Introduction to combine_first()

The combine_first() method in pandas is essentially used to combine two Series objects, where one Series can fill the null values in another. It’s particularly useful in data cleaning and preparation phases of a data analysis workflow.

Let’s start with essential imports:

import pandas as pd
import numpy as np

Basic Example

Consider two Series objects, s1 and s2, where s1 has some missing values:

s1 = pd.Series([1, np.nan, 3, np.nan, 5])
s2 = pd.Series([5, 4, 3, 2, 1])
print(s1.combine_first(s2))

Output:

0 1.0 1 4.0 2 3.0 3 2.0 4 5.0 dtype: float64

This output indicates that s2 filled in the missing values in s1.

Handling Non-Numeric Data

Not just with numeric data, combine_first() works effectively with text data too:

s1 = pd.Series(['apple', np.nan, 'carrot', np.nan])
s2 = pd.Series([np.nan, 'banana', np.nan, 'date'])
print(s1.combine_first(s2))

Output:

0 apple 1 banana 2 carrot 3 date dtype: string

In this case, s2 fills in the text missing values in s1.

Index Alignment

A key feature of combine_first() is its ability to align Series by their indexes, making it incredibly useful for combining data that may not perfectly overlap:

s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s2 = pd.Series([4, 5, 6, 7], index=['b', 'c', 'd', 'e'])
print(s1.combine_first(s2))

Output:

a 1 b 2 c 3 d 6 e 7 dtype: float64

This demonstrates how s2 completed s1, along with preserving the union of both indexes.

Working with DataFrames

Though this tutorial focuses on Series, it’s noteworthy that combine_first() can also be applied to DataFrames, addressing missing data across both rows and columns:

df1 = pd.DataFrame({'A': [1, np.nan, 3], 'B': [np.nan, 2, 3]})
df2 = pd.DataFrame({'A': [0, 4, np.nan], 'B': [1, np.nan, 5]})
print(df1.combine_first(df2))

Output:

A B 0 1.0 1.0 1 4.0 2.0 2 3.0 5.0

This reveals how df2 fills the gaps in df1, showcasing the flexibility of combine_first() across pandas objects.

Combining with Conditions

An advanced twist to using combine_first() is introducing conditions. For instance, you might only want to fill missing values if certain conditions are met:

s1 = pd.Series([1, 2, np.nan, 4])
s2 = pd.Series([10, 20, 30, 40])
def condition(s2_val): return s2_val < 30
s1_combined = s1.combine_first(s2[condition(s2)])
print(s1_combined)

Output:

0 1.0 1 2.0 2 30.0 3 4.0 dtype: float64

This example demonstrates filtering s2 with a custom condition before combining, allowing refined control over how missing values are filled.

Conclusion

In wrapping up, the combine_first() method in pandas offers a powerful avenue to fill missing data, blend series, and ensure data integrity. From handling simple numeric and text data to dealing with complex index alignments and conditional combinations, it empowers data practitioners with enhanced capabilities in their data preprocessing toolkit.