Pandas: Removing duplicate rows from a DataFrame (multiple ways)

Updated: February 20, 2024 By: Guest Contributor Post a comment

Overview

Pandas is a powerful data manipulation tool that can handle large datasets with ease. One common task in data preprocessing is removing duplicate rows from a DataFrame. Duplicates can skew your data, give you misleading results, and overall, just get in the way of clean, interpretable data. This tutorial will guide you through various methods of identifying and removing duplicate rows using Pandas, applicable from basic to advanced use cases.

Before we dive into the examples, make sure that you have Pandas installed and imported in your environment:

import pandas as pd

Identifying Duplicate Rows

To manage duplicates effectively, we first need to identify them. The duplicate() function is our primary tool for this task. It returns a Boolean Series marking duplicates as True and unique rows as False.

# Create a sample DataFrame
df = pd.DataFrame({
    'A': ['foo', 'bar', 'foo', 'baz'],
    'B': [1, 2, 1, 2],
    'C': ['x', 'y', 'x', 'y']
})

# Identify duplicates
print(df.duplicated())

# Output
# 0    False
# 1    False
# 2     True
# 3    False

Removing Duplicate Rows

Once you’ve identified duplicates, removing them is straightforward with the drop_duplicates() method. By default, this method keeps the first occurrence of the duplicate row and removes subsequent duplicates.

# Remove duplicates and keep the first occurrence
new_df = df.drop_duplicates()
print(new_df)

# Output
#      A  B  C
# 0  foo  1  x
# 1  bar  2  y
# 3  baz  2  y

Customizing Duplicate Removal

The drop_duplicates() method is versatile. You can decide which columns to consider for identifying duplicates, and whether to keep the first, last, or no duplicate rows.

By Specific Columns

# Removing duplicates based on specific columns
specific_df = df.drop_duplicates(subset=['A', 'C'])
print(specific_df)

# Output
#      A  B  C
# 0  foo  1  x
# 1  bar  2  y
# 3  baz  2  y

Keeping the Last Occurrence

# Instead of the default first, keep the last occurrence
last_df = df.drop_duplicates(keep='last')
print(last_df)

# Output
#      A  B  C
# 1  bar  2  y
# 2  foo  1  x
# 3  baz  2  y

Removing All Duplicates

# To remove all duplicates (no first or last occurrence kept)
no_dup_df = df.drop_duplicates(keep=False)
print(no_dup_df)

# Output
#      A  B  C
# 1  bar  2  y
# 3  baz  2  y

Handling Duplicates in a Large Dataset

In large datasets, efficiency is key. One way to handle duplicates more efficiently is by sorting your DataFrame before calling drop_duplicates(), especially if you’re working with a subset of columns. This can potentially speed up the process.

# Sort the DataFrame first
df_sorted = df.sort_values(by=['A', 'C'])
# Then remove duplicates
sorted_no_dup = df_sorted.drop_duplicates(subset=['A', 'C'], keep='last')
print(sorted_no_dup)

# Output
#      A  B  C
# 2  foo  1  x
# 1  bar  2  y
# 3  baz  2  y

Advanced: Using GroupBy for Complex Duplicate Scenarios

For more complex scenarios, where, for example, you want to remove duplicates but keep the row with the highest or lowest value in a specific column, groupby() combined with sort_values() can be a powerful approach.

# Example: Keep the row with the highest 'B' value for each 'A'
result_df = df.groupby('A').apply(lambda x: x.sort_values('B', ascending=False).head(1)).reset_index(drop=True)
print(result_df)

# Output
#      A  B  C
# 0  bar  2  y
# 1  baz  2  y
# 2  foo  1  x

Conclusion

Removing duplicate rows from a DataFrame is a crucial step in data preprocessing, ensuring the integrity and reliability of your analysis. Pandas offers flexible, powerful tools to identify, customize, and remove these duplicates efficiently. With this knowledge, you can clean your data effectively, laying a solid foundation for insightful data exploration and analysis.