Pandas: How to ‘INNER JOIN’ 2 DataFrames

Introduction
Getting Started with INNER JOIN
Understanding the Parameters
Joining on Multiple Columns
Advanced Joining Techniques
Handling Duplicate Keys
Conclusion

Introduction

Data manipulation and analysis are critical steps in the data science workflow. Pandas, a powerful library in Python, simplifies these tasks with its comprehensive set of functions. One of the most common operations in data analysis is joining two datasets. This tutorial will focus on the ‘INNER JOIN’ operation using Pandas, guiding you from the basics to more advanced examples.

Getting Started with INNER JOIN

The ‘INNER JOIN’ operation in Pandas resembles the same operation in SQL. It merges two DataFrames based on common columns, returning only the rows with matching values in those columns. Let’s start with a basic example to understand how to perform an ‘INNER JOIN’.

import pandas as pd

df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
                    'values1': [1, 2, 3, 4]})
df2 = pd.DataFrame({'key': ['B', 'D', 'E'],
                    'values2': [5, 6, 7]})

# Performing an INNER JOIN
df_inner_joined = pd.merge(df1, df2, on='key')
print(df_inner_joined)

Output:

  key  values1  values2
0   B        2        5
1   D        3        6

Understanding the Parameters

The pd.merge() function is central to performing inner joins. The most crucial parameter is on, which specifies the column(s) to join on. Additional parameters like how, left_on, right_on, and suffixes allow for greater flexibility in joining.

Joining on Multiple Columns

It’s common to join DataFrames based on more than one column. The following example demonstrates how to achieve this:

df1 = pd.DataFrame({'key1': ['A', 'B', 'C', 'D'],
                    'key2': [1, 2, 3, 4],
                    'values1': [10, 20, 30, 40]})
df2 = pd.DataFrame({'key1': ['B', 'D', 'E'],
                    'key2': [2, 4, 5],
                    'values2': [50, 60, 70]})

# Joining on multiple columns
df_inner_joined = pd.merge(df1, df2, on=['key1', 'key2'])
print(df_inner_joined)

Output:

  key1  key2  values1  values2
0    B     2       20       50
1    D     4       40       60

Advanced Joining Techniques

For more complex scenarios, like when joining columns have different names or when needing to perform operations before joining, Pandas offers solutions.

When the DataFrames have different column names for the joining keys, use the left_on and right_on parameters:

df1 = pd.DataFrame({'key_left': ['A', 'B', 'C', 'D'],
                    'values': [100, 200, 300, 400]})
df2 = pd.DataFrame({'key_right': ['B', 'D', 'E', 'A'],
                     'values2': [500, 600, 700, 800]})

pd.merge(df1, df2, left_on='key_left', right_on='key_right')

An operation frequently performed before joining is filtering DataFrames. Here’s how to filter before joining:

# Given DataFrames df1 and df2
# Filter df1 where values1 > 20
filtered_df1 = df1[df1['values1'] > 20].copy()

# Now perform an INNER JOIN
pd.merge(filtered_df1, df2, on='key')

Handling Duplicate Keys

When dealing with duplicate keys, it’s important to decide how these duplicates should influence the join. With Pandas, you can specify how duplicates are handled using the suffixes parameter, which assigns different suffixes to overlapping column names.

Conclusion

Performing an ‘INNER JOIN’ in Pandas is a versatile and powerful way to merge datasets based on common columns. Starting with simple joins and moving towards more complex scenarios, we’ve seen how Pandas accommodates various data manipulation needs. The flexibility and functionality of Pandas make it an indispensable tool in the toolkit of any data analyst or scientist.

Next Article: Pandas: How to ‘LEFT JOIN’ 2 DataFrames

Previous Article: Pandas: How to create a DataFrame from a single string (basic & advanced)

Series: DateFrames in Pandas

Pandas

How to Use Pandas for Geospatial Data Analysis (3 examples)

February 28, 2024