Pandas: How to ‘LEFT JOIN’ 2 DataFrames

Overview
1. Introduction to Pandas ‘LEFT JOIN’
Basic Example
Handling Non-Matching Columns
Multiple Keys Join
Adding Suffixes to Distinguish Columns
Conclusion

Overview

Performing a ‘LEFT JOIN’ between two DataFrames is a common task in data analysis and manipulation, allowing you to merge data from two separate tables based on a common key. This tutorial aims to guide you through the process, from basic to advanced examples, utilizing the powerful Pandas library in Python.

Introduction to Pandas ‘LEFT JOIN’

In Pandas, a ‘LEFT JOIN’ operation is accomplished using the merge function. This function combines rows from two DataFrames based on a common key, keeping all rows from the left DataFrame and only matching rows from the right DataFrame. Columns from both DataFrames that do not link directly are filled with NaN values for missing matches.

Basic Example

import pandas as pd

df_left = pd.DataFrame({
    'key': ['A', 'B', 'C', 'D'],
    'value_left': [1, 2, 3, 4]
})

df_right = pd.DataFrame({
    'key': ['B', 'C', 'D', 'E'],
    'value_right': [5, 6, 7, 8]
})

result = pd.merge(df_left, df_right, on='key', how='left')
print(result)

This simple example illustrates how to perform a ‘LEFT JOIN’ between df_left and df_right. The resulting DataFrame will include all rows from df_left and the matching rows from df_right. Columns from df_right without matching keys will show NaN values for those entries.

Handling Non-Matching Columns

Sometimes, DataFrames have non-matching column names that you wish to join on. Pandas allows you to specify the left and right keys separately using the left_on and right_on parameters.

df_left = pd.DataFrame({
    'key_left': ['A', 'B', 'C', 'D'],
    'value': [1, 2, 3, 4]
})

df_right = pd.DataFrame({
    'key_right': ['B', 'C', 'D', 'E'],
    'value': [5, 6, 7, 8]
})

result = pd.merge(df_left, df_right, left_on='key_left', right_on='key_right', how='left')
print(result)

This approach keeps all rows from df_left and matches with df_right where possible. Non-matching columns yield NaN values for the unmatched rows.

Multiple Keys Join

In more complex scenarios, you might need to join DataFrames based on multiple keys. This requires you to pass a list of column names to the on parameter.

df_left = pd.DataFrame({
    'key1': ['A', 'B'],
    'key2': ['C', 'D'],
    'value_left': [1, 2]
})

df_right = pd.DataFrame({
    'key1': ['B', 'C'],
    'key2': ['D', 'E'],
    'value_right': [3, 4]
})

result = pd.merge(df_left, df_right, on=['key1', 'key2'], how='left')
print(result)

This allows for more granular control over the join conditions, ensuring that only rows with matching values in all specified keys are combined.

Adding Suffixes to Distinguish Columns

When both DataFrames contain columns with the same name but unrelated content, you can use the suffixes argument to add suffixes to the overlapping column names, making them unique.

df_left = pd.DataFrame({
    'key': ['A', 'B'],
    'value': [1, 2]
})

df_right = pd.DataFrame({
    'key': ['B', 'C'],
    'value': [3, 4]
})

result = pd.merge(df_left, df_right, on='key', how='left', suffixes=('_left', '_right'))
print(result)

This technique prevents potential column name conflicts, ensuring clarity in the resulting DataFrame.

In addition to the basic ‘LEFT JOIN’, Pandas offers advanced features for complex merging scenarios, such as specifying the join type with the indicator parameter which shows from which DataFrame each row comes from, or handling overlapping indexes with left_index and right_index parameters.

Conclusion

Merging DataFrames using ‘LEFT JOIN’ in Pandas is a versatile technique for combining datasets based on shared keys. This tutorial has explored various scenarios to equip you with the knowledge to perform these operations with confidence. Understanding how to effectively use ‘LEFT JOIN’ will significantly enhance your data manipulation and analysis capabilities in Python.

Next Article: Pandas: How to ‘RIGHT JOIN’ 2 DataFrames (with examples)

Previous Article: Pandas: How to ‘INNER JOIN’ 2 DataFrames

Series: DateFrames in Pandas

Pandas

How to Use Pandas for Geospatial Data Analysis (3 examples)

February 28, 2024