Pandas: How to merge 2 DataFrames

Updated: February 20, 2024 By: Guest Contributor Post a comment

Introduction

In the world of data analysis and data science, Pandas is a cornerstone library that provides highly efficient and easy-to-use data structures. Understanding how to combine or merge datasets is fundamental as it allows for more comprehensive analysis. This tutorial will guide you through merging two DataFrames with Pandas, from basic to advanced techniques, accompanied by detailed code examples and outputs.

Let’s start with the basics. Panda’s merge() function is a powerful method for joining two DataFrames. The concept is similar to SQL joins, where you can specify how you want to merge the DataFrames – whether to use an inner, outer, left, or right join. The choice of the method depends on your specific needs and the nature of your datasets.

Basic Merge

To start, let’s consider we have two DataFrames, df1 and df2, which we want to merge. Here is an example of how a simple merge is performed.

import pandas as pd

# Sample DataFrame df1
df1 = pd.DataFrame({
    'Key': ['A', 'B', 'C', 'D'],
    'Value_df1': [1, 2, 3, 4]
})

# Sample DataFrame df2
df2 = pd.DataFrame({
    'Key': ['C', 'D', 'E', 'F'],
    'Value_df2': [3, 4, 5, 6]
})

# Merging df1 and df2 on 'Key' column
merged_df = pd.merge(df1, df2, on='Key')

print(merged_df)

This code will produce the following output:

  Key  Value_df1  Value_df2
0   C          3          3
1   D          4          4

In this example, we performed an inner join, which means the merged DataFrame contains only the rows that have common keys in both df1 and df2.

Note that we’ll continue using df1 and df2 in the coming examples.

Specifying the Merge Type

Now, let’s explore how to specify different types of merges. Panda’s merge function allows you to use the how parameter to define the join type. Here’s how to apply each one.

Outer Join

merged_df = pd.merge(df1, df2, on='Key', how='outer') 
print(merged_df) 

This code snippet will produce:

  Key  Value_df1  Value_df2
0   A        1.0        NaN
1   B        2.0        NaN
2   C        3.0        3.0
3   D        4.0        4.0
4   E        NaN        5.0
5   F        NaN        6.0

An outer join returns all rows from both DataFrames, with NaN in place of missing values.

Left and Right Joins

Applying a left join means that all rows from the left DataFrame (df1 in our example) are included in the output, along with any common rows from the right DataFrame (df2). A right join is just the opposite. Here’s how to apply these joins:

# Left join 
left_merge_df = pd.merge(df1, df2, on='Key', how='left') 
print(left_merge_df) 

# Right join 
right_merge_df = pd.merge(df1, df2, how='right', on='Key') 
print(right_merge_df) 

The left join output:

  Key  Value_df1  Value_df2
0   A          1        NaN
1   B          2        NaN
2   C          3        3.0
3   D          4        4.0

The right join output:

  Key  Value_df1  Value_df2
0   C        3.0          3
1   D        4.0          4
2   E        NaN          5
3   F        NaN          6

Advanced Merges

As we move to more advanced examples, let’s discuss merging DataFrames with different key column names, using the left_on and right_on parameters.

import pandas as pd

# Sample DataFrame df1
df1 = pd.DataFrame({
    'A_Key': ['A', 'B', 'C', 'D'],
    'Value_df1': [1, 2, 3, 4]
})

# Sample DataFrame df2
df2 = pd.DataFrame({
    'B_Key': ['C', 'D', 'E', 'F'],
    'Value_df2': [3, 4, 5, 6]
})

# Merge with different key columns
merged_difference = pd.merge(df1, df2, left_on='A_Key', right_on='B_Key')

print(merged_difference)

This results in:

  A_Key  Value_df1 B_Key  Value_df2
0     C          3     C          3
1     D          4     D          4

This technique is particularly useful when merging DataFrames that do not have the same column names but represent the same entity.

Conclusion

Merging two DataFrames is a powerful way to combine data for enhanced analysis and insights. Whether you’re applying a basic inner join or delving into more complex merges with different key columns, Pandas provides the flexibility and robustness required for effective data manipulation. Remember, the key to a successful merge lies in understanding the nature of your data and selecting the merge type that best fits your analysis needs.