Pandas: How to remove all duplicate rows across multiple columns

Updated: February 20, 2024 By: Guest Contributor Post a comment

Introduction

When working with large datasets, duplicates can skew your analysis, leading to incorrect conclusions. Pandas, a powerful Python library for data analysis and manipulation, provides intuitive methods to identify and remove duplicate rows. This tutorial explores how to handle duplicates across multiple columns, ranging from basic to advanced techniques, using Pandas.

Understanding Duplicates in Pandas

In Pandas, a duplicate row is a row that is identical to another row in all or a specified subset of columns. While it’s sometimes necessary to have duplicate observations in a dataset, often, duplicates can be a result of data collection methods or data entry errors.

Let’s start with a simple dataset to understand how duplicates can be identified and removed.

Creating a Sample DataFrame

import pandas as pd

# Sample dataset
data = {
    "Name": ["Alex", "Brian", "Charlie", "Alex", "Emma", "Charlie"],
    "Age": [25, 30, 35, 25, 40, 35],
    "City": [
        "New York",
        "Los Angeles",
        "Chicago",
        "New York",
        "San Francisco",
        "Chicago",
    ],
}
df = pd.DataFrame(data)
print(df)

Output:

      Name  Age           City
0     Alex   25       New York
1    Brian   30    Los Angeles
2  Charlie   35        Chicago
3     Alex   25       New York
4     Emma   40  San Francisco
5  Charlie   35        Chicago

This basic DataFrame shows six rows with potential duplicates. How can we identify and remove these duplicates across multiple columns?

Removing Duplicate Rows

We can use the .drop_duplicates() method provided by Pandas to remove duplicates. By default, this method keeps the first occurrence of a duplicate row and removes subsequent ones. It can also be customized to keep the last occurrence or remove all duplicates entirely.

Example 1: Remove All Duplicate Rows

# Removing all duplicate rows
df.drop_duplicates(inplace=True)
print(df)

Output:

      Name  Age           City
0     Alex   25       New York
1    Brian   30    Los Angeles
2  Charlie   35        Chicago
4     Emma   40  San Francisco

This code removes any row that has an exact match in all columns, but you can also specify columns to check for duplicates.

Example 2: Specifying Columns to Check for Duplicates

# Specifying columns to check for duplicates
df.drop_duplicates(subset=['Name', 'City'], inplace=True)
print(df)

Output:

      Name  Age           City
0     Alex   25       New York
1    Brian   30    Los Angeles
2  Charlie   35        Chicago
4     Emma   40  San Francisco

In this example, only the ‘Name’ and ‘City’ columns are considered for identifying duplicates. The resulting DataFrame will not include rows where these two columns have the same values.

Advanced Techniques for Removing Duplicates

For more complex datasets, you may need to employ more sophisticated strategies to manage duplicates.

Using a Custom Function to Identify Duplicates

Sometimes, duplicates are not exact matches. For instance, one entry might have a slight spelling discrepancy but refer to the same entity. You can use the .apply() and lambda functions to create complex conditions for identifying duplicates.

import pandas as pd

# Sample dataset
data = {
    "Name": ["Alex", "Brian", "Charlie", "Alex", "Emma", "Charlie"],
    "Age": [25, 30, 35, 25, 40, 35],
    "City": [
        "New York",
        "Los Angeles",
        "Chicago",
        "New York",
        "San Francisco",
        "Chicago",
    ],
}
df = pd.DataFrame(data)


# Adjusted is_duplicate function to work with DataFrame directly
def is_duplicate(row):
    # Filter df for rows with the same 'Name' and 'City' as the current row
    matches = df[(df["Name"] == row["Name"]) & (df["City"] == row["City"])]
    # Return True if more than one match found, indicating a duplicate
    return len(matches) > 1


# Apply is_duplicate function across DataFrame rows
df["is_dup"] = df.apply(is_duplicate, axis=1)

# Remove rows flagged as duplicates
df_no_duplicates = df[df["is_dup"] == False].drop(columns=["is_dup"])

print(df_no_duplicates)

Ouput:

    Name  Age           City
1  Brian   30    Los Angeles
4   Emma   40  San Francisco

In this example, the is_duplicate function contaisn the logic to identify duplicates. You can adjust it based on your specific criteria, marking true duplicates with True and non-duplicates with False. Then, rows marked as duplicates can be removed from the DataFrame.

Conclusion

Pandas offers flexible, efficient ways to remove duplicates from your datasets, ensuring the integrity of your data analysis. By leveraging the .drop_duplicates() method and tailoring its use with subset specification or custom identification functions, you can effectively manage and clean your data sets. Removing duplicates is a critical step in preparing your data for insightful analysis.