Overview
The join()
method in Pandandas handles merging datasets horizontally by aligning rows based on their indices. It’s a critical tool for data manipulation and is versatile enough to cater to a variety of data join scenarios. This tutorial aims to elucidate the join()
method through five examples of increasing complexity.
Prerequisites: To follow this tutorial, you should have a basic understanding of Python and pandas. Ensure pandas is installed in your Python environment: pip install pandas
.
Example #1 – Basic Join Operation
Scenario: You have two DataFrames, df1
and df2
, containing user data. You wish to combine them horizontally based on their indices.
import pandas as pd
df1 = pd.DataFrame({"Name": ["John", "Anna", "Mike"], "Age": [28, 34, 45]})
df2 = pd.DataFrame(
{"Email": ["[email protected]", "[email protected]", "[email protected]"]},
index=[0, 1, 2],
)
result = df1.join(df2)
print(result)
Output:
Name Age Email
0 John 28 [email protected]
1 Anna 34 [email protected]
2 Mike 45 [email protected]
Example #2 – Joining with Different Indexes
Scenario: Now, let’s assume df2
has a different index alignment. You still want to join df1
and df2
based on df1
‘s index.
import pandas as pd
df1 = pd.DataFrame({"Name": ["John", "Anna", "Mike"], "Age": [28, 34, 45]})
df2 = pd.DataFrame({"Email": ["[email protected]", "[email protected]"]}, index=[1, 2])
result = df1.join(df2, how="left")
print(result)
Output:
Name Age Email
0 John 28 NaN
1 Anna 34 [email protected]
2 Mike 45 [email protected]
Example #3 – Joining with Overlapping Columns
Scenario: What happens when the DataFrames to be joined have overlapping column names? The join()
method provides the lsuffix
and rsuffix
parameters to handle this.
import pandas as pd
df1 = pd.DataFrame(
{
"Name": ["John", "Anna", "Mike"],
"Email": ["[email protected]", "[email protected]", "[email protected]"],
}
)
df2 = pd.DataFrame(
{
"Email": ["[email protected]", "[email protected]", "[email protected]"],
"Location": ["NY", "LA", "Chicago"],
},
index=[0, 1, 2],
)
result = df1.join(df2, lsuffix="_left", rsuffix="_right")
print(result)
Output:
Name Email_left Email_right Location
0 John [email protected] [email protected] NY
1 Anna [email protected] [email protected] LA
2 Mike [email protected] [email protected] Chicago
Example #4 – Using the on
parameter
Scenario: Sometimes, you might want to join on a specific column rather than the index. In this case, you can use the on
parameter. Note that the column specified in on
must be present in both DataFrames.
import pandas as pd
df1 = pd.DataFrame(
{
"Name": ["John", "Anna", "Mike"],
"Email": ["[email protected]", "[email protected]", "[email protected]"],
}
)
df2 = pd.DataFrame(
{
"Name": ["John", "Anna", "Mike"],
"Location": ["NY", "LA", "Chicago"],
},
index=[0, 1, 2],
)
df1.set_index("Name", inplace=True)
df2.set_index("Name", inplace=True)
result = df1.join(df2, on="Name")
print(result)
Output:
Email Location
Name
John [email protected] NY
Anna [email protected] LA
Mike [email protected] Chicago
Example #5 – Complex Join with Multiple DataFrames
Scenario: For our final example, let’s combine multiple join operations into a single line of code, showcasing the versatility of the join()
method.
import pandas as pd
df1 = pd.DataFrame({"Name": ["John", "Anna", "Mike"], "Age": [28, 34, 45]})
df2 = pd.DataFrame(
{
"Email": ["[email protected]", "[email protected]", "[email protected]"],
"Location": ["NY", "LA", "Chicago"],
},
index=[0, 1, 2],
)
df3 = pd.DataFrame({"Salary": [70000, 120000, 90000]}, index=["John", "Anna", "Mike"])
result = df1.join([df2, df3])
print(result)
Output:
Name Age Email Location Salary
0 John 28.0 [email protected] NY NaN
1 Anna 34.0 [email protected] LA NaN
2 Mike 45.0 [email protected] Chicago NaN
Conclusion
The join()
method in pandas is a powerful function for horizontally combining DataFrames. As we’ve explored through five examples, it adapts to various data alignment and merging scenarios, making your data manipulation tasks more efficient and streamlined.