Pandas DataFrame: How to compare 2 columns (row-wise)

Updated: February 21, 2024 By: Guest Contributor Post a comment

Introduction

Comparing two columns in a Pandas DataFrame is a common operation that you might need to perform for various data analysis tasks. Whether you’re looking to identify mismatches, find matches, or simply evaluate the differences between columns, Pandas provides versatile tools to accomplish these tasks effectively. This tutorial will guide you through several methods, from basic to advanced, including real-life code examples and their outputs.

Basic Comparison with Equality Operator

The simplest way to compare two columns in a DataFrame is by using the equality operator (==). This method allows you to check if the values in one column are equal to those in another column on a row-wise basis.

import pandas as pd
data = {'A': [1, 2, 3, 4], 'B': [1, 2, 4, 4]}
df = pd.DataFrame(data)
df['A_equals_B'] = df['A'] == df['B']
print(df)

Output:

   A  B  A_equals_B
0  1  1        True
1  2  2        True
2  3  4       False
3  4  4        True

This basic comparison gives you a new column indicating whether the values in columns A and B match on a row-wise basis.

Using the numpy.where Method

For more flexibility in comparison, such as custom output for matches and mismatches, numpy.where is an excellent choice. It allows you to specify the output for both conditions.

import numpy as np
import pandas as pd
data = {'A': [1, 2, 3, 4], 'B': [1, 2, 0, 4]}
df = pd.DataFrame(data)
df['Comparison'] = np.where(df['A'] == df['B'], 'Match', 'Mismatch')
print(df)

Output:

   A  B Comparison
0  1  1      Match
1  2  2      Match
2  3  0   Mismatch
3  4  4      Match

With numpy.where, you easily customize the output for visual distinction between matches and mismatches.

Column-wise Comparison Using apply with a Custom Function

If you require a more complex comparison, for instance, allowing a tolerance level in numeric differences, apply combined with a custom lambda function could be the answer. This approach gives you the flexibility to define any logic for comparison.

import pandas as pd
data = {'A': [100, 200, 300, 400], 'B': [100, 210, 295, 405]}
df = pd.DataFrame(data)
comparison_fn = lambda row: abs(row['A'] - row['B']) \<= 10
df['Close_Enough'] = df.apply(comparison_fn, axis=1)
print(df)

Output:

     A    B  Close_Enough
0  100  100          True
1  200  210          True
2  300  295          True
3  400  405          True

This method showcases how to apply a custom comparison that checks if the values in two columns are within a specified tolerance level.

Advanced: Using pandas.merge for Complex Comparisons

For comparing two columns from different DataFrames or conducting more sophisticated comparisons, pandas.merge can serve your needs. This method allows you to join two DataFrames based on the matching of column values.

For this example, let’s assume you have two DataFrames and you want to find common rows based on columns A and B.

df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [2, 3, 4], 'B': [5, 6, 7]})
common_rows = pd.merge(df1, df2, on=['A', 'B'])
print(common_rows)

Output:

   A  B
0  2  5
1  3  6

This shows how pandas.merge can be used to find rows that match across two DataFrames based on specified columns.

Conclusion

From simple equality checks to complex comparisons involving custom logic or different DataFrames, Pandas provides a robust set of tools for comparing columns row-wise. By understanding and utilizing these methods, you can efficiently conduct various data comparison tasks to aid your data analysis process.