Table of Contents
Introduction
When working with large datasets, duplicates can skew your analysis, leading to incorrect conclusions. Pandas, a powerful Python library for data analysis and manipulation, provides intuitive methods to identify and remove duplicate rows. This tutorial explores how to handle duplicates across multiple columns, ranging from basic to advanced techniques, using Pandas.
Understanding Duplicates in Pandas
In Pandas, a duplicate row is a row that is identical to another row in all or a specified subset of columns. While it’s sometimes necessary to have duplicate observations in a dataset, often, duplicates can be a result of data collection methods or data entry errors.
Let’s start with a simple dataset to understand how duplicates can be identified and removed.
Creating a Sample DataFrame
import pandas as pd
# Sample dataset
data = {
"Name": ["Alex", "Brian", "Charlie", "Alex", "Emma", "Charlie"],
"Age": [25, 30, 35, 25, 40, 35],
"City": [
"New York",
"Los Angeles",
"Chicago",
"New York",
"San Francisco",
"Chicago",
],
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age City
0 Alex 25 New York
1 Brian 30 Los Angeles
2 Charlie 35 Chicago
3 Alex 25 New York
4 Emma 40 San Francisco
5 Charlie 35 Chicago
This basic DataFrame shows six rows with potential duplicates. How can we identify and remove these duplicates across multiple columns?
Removing Duplicate Rows
We can use the .drop_duplicates()
method provided by Pandas to remove duplicates. By default, this method keeps the first occurrence of a duplicate row and removes subsequent ones. It can also be customized to keep the last occurrence or remove all duplicates entirely.
Example 1: Remove All Duplicate Rows
# Removing all duplicate rows
df.drop_duplicates(inplace=True)
print(df)
Output:
Name Age City
0 Alex 25 New York
1 Brian 30 Los Angeles
2 Charlie 35 Chicago
4 Emma 40 San Francisco
This code removes any row that has an exact match in all columns, but you can also specify columns to check for duplicates.
Example 2: Specifying Columns to Check for Duplicates
# Specifying columns to check for duplicates
df.drop_duplicates(subset=['Name', 'City'], inplace=True)
print(df)
Output:
Name Age City
0 Alex 25 New York
1 Brian 30 Los Angeles
2 Charlie 35 Chicago
4 Emma 40 San Francisco
In this example, only the ‘Name’ and ‘City’ columns are considered for identifying duplicates. The resulting DataFrame will not include rows where these two columns have the same values.
Advanced Techniques for Removing Duplicates
For more complex datasets, you may need to employ more sophisticated strategies to manage duplicates.
Using a Custom Function to Identify Duplicates
Sometimes, duplicates are not exact matches. For instance, one entry might have a slight spelling discrepancy but refer to the same entity. You can use the .apply()
and lambda
functions to create complex conditions for identifying duplicates.
import pandas as pd
# Sample dataset
data = {
"Name": ["Alex", "Brian", "Charlie", "Alex", "Emma", "Charlie"],
"Age": [25, 30, 35, 25, 40, 35],
"City": [
"New York",
"Los Angeles",
"Chicago",
"New York",
"San Francisco",
"Chicago",
],
}
df = pd.DataFrame(data)
# Adjusted is_duplicate function to work with DataFrame directly
def is_duplicate(row):
# Filter df for rows with the same 'Name' and 'City' as the current row
matches = df[(df["Name"] == row["Name"]) & (df["City"] == row["City"])]
# Return True if more than one match found, indicating a duplicate
return len(matches) > 1
# Apply is_duplicate function across DataFrame rows
df["is_dup"] = df.apply(is_duplicate, axis=1)
# Remove rows flagged as duplicates
df_no_duplicates = df[df["is_dup"] == False].drop(columns=["is_dup"])
print(df_no_duplicates)
Ouput:
Name Age City
1 Brian 30 Los Angeles
4 Emma 40 San Francisco
In this example, the is_duplicate
function contaisn the logic to identify duplicates. You can adjust it based on your specific criteria, marking true duplicates with True
and non-duplicates with False
. Then, rows marked as duplicates can be removed from the DataFrame.
Conclusion
Pandas offers flexible, efficient ways to remove duplicates from your datasets, ensuring the integrity of your data analysis. By leveraging the .drop_duplicates()
method and tailoring its use with subset specification or custom identification functions, you can effectively manage and clean your data sets. Removing duplicates is a critical step in preparing your data for insightful analysis.