Pandas: How to compare 2 Series and show the difference

Updated: February 22, 2024 By: Guest Contributor Post a comment

Introduction

In this tutorial, we will dive into comparing two Pandas Series and how to display their differences using various functions and methods available in the Pandas library. Whether you’re dealing with large datasets or requiring a quick data comparison, understanding how to effectively compare two Series is crucial in data analysis. We will start with simple examples and gradually move to more complex scenarios, ensuring a comprehensive understanding of comparing Series in Pandas.

Preparation

A Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index. Comparing two Pandas Series means we are looking for differences or similarities between these series, which could be in terms of values or indices.

Setup for Examples

Before diving into the examples, make sure you have Pandas installed in your environment:

!pip install pandas

Once installed, we can import Pandas and create our first example Series:

import pandas as pd

series1 = pd.Series([1, 2, 3, 4, 5])
series2 = pd.Series([1, 2, 3, 4, 6])

Basic Comparison

For our first example, we’ll simply check if the two series are equal:

print(series1.equals(series2))

This will return False because the two series differ. Now, if we want to know at which positions the series differ, we can use:

diff_indices = series1 != series2
print(diff_indices)

This provides us with a boolean Series indicating the positions where the series differ.

Using the subtract Method

Another way to find differences is by using the subtract method:

difference = series1.subtract(series2)
print(difference)

This method will subtract series2 from series1, showing the actual difference in values. Non-zero values indicate differences.

Advanced Comparison Techniques

For more complex scenarios, such as when dealing with non-numeric data or needing more detail on differences, we can use the combination of boolean indexing with other methods.

series1 = pd.Series(['a', 'b', 'c', 'd', 'e'])
series2 = pd.Series(['a', 'b', 'x', 'y', 'e'])

# Identifying differences
diff = series1[series1 != series2]
print('Differences in series1 relative to series2:\n', diff)

This prints out the values in series1 that are different from series2. A similar approach can identify values in series2 that are not in series1.

Using the merge Function for Comparison

We can also utilize the merge function to compare series, especially useful when dealing with series that have different lengths or when we are interested in examining if the series share common values:

merged = pd.merge(series1.reset_index(), series2.reset_index(), on=0, how='outer', indicator=True)
print(merged)

This approach essentially treats the series as mini dataframes, allows us to reset their indices, and merge them based on their values. The indicator parameter adds a column to the output, showing whether each value is from both series, left only, or right only.

Conclusion

Comparing two Pandas Series can range from straightforward methods like simple equality checks and subtraction to more nuanced approaches involving boolean indexing and data frame merges. Through this tutorial, you should now have a solid foundation for identifying differences across Series and applying these techniques to your data analysis tasks. With practice, these methods will become an integral part of your Pandas toolbox.