Pandas: How to ‘FULL JOIN’ 2 DataFrames (3 examples)

Overview
Example 1: Basic Full Join
Example 2: Adding Missing Keys
Example 3: Advanced Full Join with Multiple DataFrames
Conclusion

Overview

In this tutorial, we will explore how to perform a full join, often referred to as a full outer join, between two DataFrames using Pandas in Python. Performing a full join means combining the rows from two tables where there is a match on a common column(s), and including all rows from both DataFrames where there is no match. The result is a DataFrame that has all records from both tables, filled with NaNs (a standard for missing values in Pandas) where there’s no match.

Before diving in, ensure you have the latest version of Pandas installed in your Python environment. If not, you can install or upgrade Pandas using pip:

pip install pandas --upgrade

Example 1: Basic Full Join

Let’s start with the basics. Imagine you have two simple DataFrames-players from two teams, and you want to join them to see a complete roster of players.

import pandas as pd

# Create the first DataFrame
team_a = {'PlayerID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']}
df_team_a = pd.DataFrame(team_a)

# Create the second DataFrame
team_b = {'PlayerID': [4, 5, 3], 'Name': ['Diana', 'Evan', 'Charlie']}
df_team_b = pd.DataFrame(team_b)

# Perform a full join
df_full_join = pd.merge(df_team_a, df_team_b, on='PlayerID', how='outer', suffixes=('_a', '_b'))
print(df_full_join)

The output shows all the players from both teams, including a match on ‘Charlie’ who appears in both teams:

   PlayerID   Name_a   Name_b
c 0        1    Alice      NaN
c 1        2      Bob      NaN
c 2        3  Charlie  Charlie
c 3        4      NaN    Diana
c 4        5      NaN     Evan

Example 2: Adding Missing Keys

Next, consider a scenario where you have additional columns you want to match on, but those columns might have NaN values. Here’s how to handle missing keys in your join operation:

import pandas as pd

# Assuming df_team_a and df_team_b from Example 1

# Adding a new column 'Position' with potential missing values
df_team_a['Position'] = ['Forward', 'Midfield', None]
df_team_b['Position'] = [None, 'Defender', 'Forward']

# Performing a full join allowing for missing 'Position' keys
df_full_join_with_missing_keys = pd.merge(df_team_a, df_team_b, on=['PlayerID', 'Position'], how='outer', suffixes=('_a', '_b'))
print(df_full_join_with_missing_keys)

The output:

   PlayerID   Name_a   Name_b   Position
0        1    Alice      NaN   Forward
1        2      Bob      NaN  Midfield
2        3  Charlie  Charlie   Forward
3        4      NaN    Diana      NaN
4        5      NaN     Evan  Defender

Example 3: Advanced Full Join with Multiple DataFrames

Now let’s push the boundaries by joining more than two DataFrames and managing columns with different names but identical meanings using the left_on and right_on parameters. Imagine another team’s DataFrame with players’ data.

import pandas as pd

# Assuming df_team_a and df_team_b from Example 1

# Another DataFrame for team_c
team_c = {'ID': [6, 3, 7], 'Name': ['Frank', 'Charlie', 'Georgia']}
df_team_c = pd.DataFrame(team_c)

# Performing a full join between df_team_a and df_team_b, then with df_team_c
df_full_join_multi = pd.merge(pd.merge(df_team_a, df_team_b, on='PlayerID', how='outer'), df_team_c, left_on='PlayerID', right_on='ID', how='outer', suffixes=('', '_c'))
print(df_full_join_multi)

The output includes all players from the three teams, demonstrating a complex full join operation:

   PlayerID     Name Name_c   ID
0        1    Alice    NaN  NaN
1        2      Bob    NaN  NaN
2        3  Charlie  Charlie  3.0
3        4      NaN  Diana   NaN
4        5      NaN  Evan    NaN
5        6      NaN  Frank   6.0
6        7      NaN  Georgia 7.0

Conclusion

Through these examples, we’ve seen the power and flexibility of Pandas for data merging, especially for performing a full join. Starting with a simple join and then gradually incorporating nuances such as missing keys and joining multiple frames shows Pandas’ utility in data analysis and preprocessing tasks. Feel free to experiment further with your DataFrames to unlock even more insights.

Next Article: Pandas: How to ‘CROSS JOIN’ 2 DataFrames (5 examples)

Previous Article: Pandas: How to ‘RIGHT JOIN’ 2 DataFrames (with examples)

Series: DateFrames in Pandas

Pandas

How to Use Pandas for Geospatial Data Analysis (3 examples)

February 28, 2024