Understanding DataFrame.join() method in Pandas (5 examples)

Updated: February 20, 2024 By: Guest Contributor Post a comment

Overview

The join() method in Pandandas handles merging datasets horizontally by aligning rows based on their indices. It’s a critical tool for data manipulation and is versatile enough to cater to a variety of data join scenarios. This tutorial aims to elucidate the join() method through five examples of increasing complexity.

Prerequisites: To follow this tutorial, you should have a basic understanding of Python and pandas. Ensure pandas is installed in your Python environment: pip install pandas.

Example #1 – Basic Join Operation

Scenario: You have two DataFrames, df1 and df2, containing user data. You wish to combine them horizontally based on their indices.

import pandas as pd

df1 = pd.DataFrame({"Name": ["John", "Anna", "Mike"], "Age": [28, 34, 45]})
df2 = pd.DataFrame(
    {"Email": ["[email protected]", "[email protected]", "[email protected]"]},
    index=[0, 1, 2],
)

result = df1.join(df2)
print(result)

Output:

   Name  Age                  Email
0  John   28     [email protected]
1  Anna   34     [email protected]
2  Mike   45  [email protected]

Example #2 – Joining with Different Indexes

Scenario: Now, let’s assume df2 has a different index alignment. You still want to join df1 and df2 based on df1‘s index.

import pandas as pd

df1 = pd.DataFrame({"Name": ["John", "Anna", "Mike"], "Age": [28, 34, 45]})
df2 = pd.DataFrame({"Email": ["[email protected]", "[email protected]"]}, index=[1, 2])
result = df1.join(df2, how="left")
print(result)

Output:

   Name  Age                Email
0  John   28                  NaN
1  Anna   34     [email protected]
2  Mike   45  [email protected]

Example #3 – Joining with Overlapping Columns

Scenario: What happens when the DataFrames to be joined have overlapping column names? The join() method provides the lsuffix and rsuffix parameters to handle this.

import pandas as pd

df1 = pd.DataFrame(
    {
        "Name": ["John", "Anna", "Mike"],
        "Email": ["[email protected]", "[email protected]", "[email protected]"],
    }
)
df2 = pd.DataFrame(
    {
        "Email": ["[email protected]", "[email protected]", "[email protected]"],
        "Location": ["NY", "LA", "Chicago"],
    },
    index=[0, 1, 2],
)
result = df1.join(df2, lsuffix="_left", rsuffix="_right")
print(result)

Output:

   Name            Email_left           Email_right  Location
0  John    [email protected]     [email protected]        NY
1  Anna    [email protected]     [email protected]      LA
2  Mike    [email protected]  [email protected]  Chicago

Example #4 – Using the on parameter

Scenario: Sometimes, you might want to join on a specific column rather than the index. In this case, you can use the on parameter. Note that the column specified in on must be present in both DataFrames.

import pandas as pd

df1 = pd.DataFrame(
    {
        "Name": ["John", "Anna", "Mike"],
        "Email": ["[email protected]", "[email protected]", "[email protected]"],
    }
)
df2 = pd.DataFrame(
    {
        "Name": ["John", "Anna", "Mike"],
        "Location": ["NY", "LA", "Chicago"],
    },
    index=[0, 1, 2],
)

df1.set_index("Name", inplace=True)
df2.set_index("Name", inplace=True)
result = df1.join(df2, on="Name")
print(result)

Output:

                   Email Location
Name                             
John  [email protected]       NY
Anna   [email protected]       LA
Mike    [email protected]  Chicago

Example #5 – Complex Join with Multiple DataFrames

Scenario: For our final example, let’s combine multiple join operations into a single line of code, showcasing the versatility of the join() method.

import pandas as pd

df1 = pd.DataFrame({"Name": ["John", "Anna", "Mike"], "Age": [28, 34, 45]})
df2 = pd.DataFrame(
    {
        "Email": ["[email protected]", "[email protected]", "[email protected]"],
        "Location": ["NY", "LA", "Chicago"],
    },
    index=[0, 1, 2],
)

df3 = pd.DataFrame({"Salary": [70000, 120000, 90000]}, index=["John", "Anna", "Mike"])

result = df1.join([df2, df3])
print(result)

Output:

   Name   Age                   Email Location  Salary
0  John  28.0        [email protected]       NY     NaN
1  Anna  34.0     [email protected]       LA     NaN
2  Mike  45.0  [email protected]  Chicago     NaN

Conclusion

The join() method in pandas is a powerful function for horizontally combining DataFrames. As we’ve explored through five examples, it adapts to various data alignment and merging scenarios, making your data manipulation tasks more efficient and streamlined.