Sling Academy
Home/Pandas/Pandas: How to ‘INNER JOIN’ 2 DataFrames

Pandas: How to ‘INNER JOIN’ 2 DataFrames

Last updated: February 21, 2024

Introduction

Data manipulation and analysis are critical steps in the data science workflow. Pandas, a powerful library in Python, simplifies these tasks with its comprehensive set of functions. One of the most common operations in data analysis is joining two datasets. This tutorial will focus on the ‘INNER JOIN’ operation using Pandas, guiding you from the basics to more advanced examples.

Getting Started with INNER JOIN

The ‘INNER JOIN’ operation in Pandas resembles the same operation in SQL. It merges two DataFrames based on common columns, returning only the rows with matching values in those columns. Let’s start with a basic example to understand how to perform an ‘INNER JOIN’.

import pandas as pd

df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
                    'values1': [1, 2, 3, 4]})
df2 = pd.DataFrame({'key': ['B', 'D', 'E'],
                    'values2': [5, 6, 7]})

# Performing an INNER JOIN
df_inner_joined = pd.merge(df1, df2, on='key')
print(df_inner_joined)

Output:

  key  values1  values2
0   B        2        5
1   D        3        6

Understanding the Parameters

The pd.merge() function is central to performing inner joins. The most crucial parameter is on, which specifies the column(s) to join on. Additional parameters like how, left_on, right_on, and suffixes allow for greater flexibility in joining.

Joining on Multiple Columns

It’s common to join DataFrames based on more than one column. The following example demonstrates how to achieve this:

df1 = pd.DataFrame({'key1': ['A', 'B', 'C', 'D'],
                    'key2': [1, 2, 3, 4],
                    'values1': [10, 20, 30, 40]})
df2 = pd.DataFrame({'key1': ['B', 'D', 'E'],
                    'key2': [2, 4, 5],
                    'values2': [50, 60, 70]})

# Joining on multiple columns
df_inner_joined = pd.merge(df1, df2, on=['key1', 'key2'])
print(df_inner_joined)

Output:

  key1  key2  values1  values2
0    B     2       20       50
1    D     4       40       60

Advanced Joining Techniques

For more complex scenarios, like when joining columns have different names or when needing to perform operations before joining, Pandas offers solutions.

When the DataFrames have different column names for the joining keys, use the left_on and right_on parameters:

df1 = pd.DataFrame({'key_left': ['A', 'B', 'C', 'D'],
                    'values': [100, 200, 300, 400]})
df2 = pd.DataFrame({'key_right': ['B', 'D', 'E', 'A'],
                     'values2': [500, 600, 700, 800]})

pd.merge(df1, df2, left_on='key_left', right_on='key_right')

An operation frequently performed before joining is filtering DataFrames. Here’s how to filter before joining:

# Given DataFrames df1 and df2
# Filter df1 where values1 > 20
filtered_df1 = df1[df1['values1'] > 20].copy()

# Now perform an INNER JOIN
pd.merge(filtered_df1, df2, on='key')

Handling Duplicate Keys

When dealing with duplicate keys, it’s important to decide how these duplicates should influence the join. With Pandas, you can specify how duplicates are handled using the suffixes parameter, which assigns different suffixes to overlapping column names.

Conclusion

Performing an ‘INNER JOIN’ in Pandas is a versatile and powerful way to merge datasets based on common columns. Starting with simple joins and moving towards more complex scenarios, we’ve seen how Pandas accommodates various data manipulation needs. The flexibility and functionality of Pandas make it an indispensable tool in the toolkit of any data analyst or scientist.

Next Article: Pandas: How to ‘LEFT JOIN’ 2 DataFrames

Previous Article: Pandas: How to create a DataFrame from a single string (basic & advanced)

Series: DateFrames in Pandas

Pandas

You May Also Like

  • How to Use Pandas Profiling for Data Analysis (4 examples)
  • How to Handle Large Datasets with Pandas and Dask (4 examples)
  • Pandas – Using DataFrame.pivot() method (3 examples)
  • Pandas: How to ‘FULL JOIN’ 2 DataFrames (3 examples)
  • Pandas: Select columns whose names start/end with a specific string (4 examples)
  • 3 ways to turn off future warnings in Pandas
  • How to Integrate Pandas with Apache Spark
  • How to Use Pandas for Web Scraping and Saving Data (2 examples)
  • How to Clean and Preprocess Text Data with Pandas (3 examples)
  • Pandas – Using Series.replace() method (3 examples)
  • Pandas json_normalize() function: Explained with examples
  • Pandas: Reading CSV and Excel files from AWS S3 (4 examples)
  • Using pandas.Series.rank() method (4 examples)
  • Pandas: Dropping columns whose names contain a specific string (4 examples)
  • Pandas: How to print a DataFrame without index (3 ways)
  • Fixing Pandas NameError: name ‘df’ is not defined
  • Pandas – Using DataFrame idxmax() and idxmin() methods (4 examples)
  • Pandas FutureWarning: ‘M’ is deprecated and will be removed in a future version, please use ‘ME’ instead
  • Pandas: Checking equality of 2 DataFrames (element-wise)