Pandas – DataFrame.duplicated() method (5 examples)

Updated: February 22, 2024 By: Guest Contributor Post a comment

Introduction

Pandas is a cornerstone tool in data analysis and manipulation activities, highly regarded for its ease of use and flexibility. One of the essential functions available in Pandas for cleaning and preparing data is the DataFrame.duplicated() method. This method helps identify duplicate rows within a DataFrame, allowing for efficient data cleaning and deduplication processes. In this tutorial, we will explore the DataFrame.duplicated() method through five comprehensive examples, ranging from basic applications to more advanced uses.

What is the Point?

Before diving into examples, it’s essential to understand what the DataFrame.duplicated() method does. It returns a Boolean Series indicating whether each row is a duplicate of a row encountered earlier in the DataFrame. The method takes several parameters, including:

  • subset: Specifies the columns for considering duplicates. The default value checks all columns.
  • keep: Determines which duplicate to consider as unique. Options include 'first' (default), 'last', or False (mark all duplicates as True).

Example 1: Basic Usage

import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 2, 3, 4, 4],
    'B': ['a', 'b', 'b', 'c', 'd', 'd']
})

print(df.duplicated())

Output:

0    False
1    False
2     True
3    False
4    False
5     True
dtype: bool

This basic example shows how duplicated() can effortlessly identify duplicate rows based on all columns.

Example 2: Specifying Columns

print(df.duplicated(subset=['A']))

Output:

0    False
1    False
2     True
3    False
4    False
5     True
dtype: bool

In this example, we specify the subset parameter to consider duplicates only in the ‘A’ column. It highlights the flexibility of duplicated() in handling deduplication based on specific columns.

Example 3: Keeping Last Occurrences

print(df.duplicated(keep='last'))

Output:

0    False
1     True
2    False
3    False
4     True
5    False
dtype: bool

By setting keep='last', we can mark the first occurrence of a duplicate as True, preserving the last one. This approach is especially useful when the most recent entry (last occurrence) holds more relevance.

Example 4: Marking All Duplicates

print(df.duplicated(keep=False))

Output:

0    False
1     True
2     True
3    False
4     True
5     True
dtype: bool

This example demonstrates how setting keep to False marks all duplicates as True, including the first occurrence. It’s a rigorous approach for identifying and removing all duplicates from a DataFrame.

Example 5: Advanced Use Case

For more advanced scenarios, consider a DataFrame with timestamped data, wherein duplicates need to be identified based on a specific time window. This case requires additional processing, as DataFrame.duplicated() does not natively support time windows directly. However, techniques such as rounding time columns to a specific granularity can simulate this behavior.

df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df['RoundedTimestamp'] = df['Timestamp'].dt.round('H')
print(df.duplicated(subset=['A', 'RoundedTimestamp']))

Although not directly an example of duplicated(), it’s a practical application of pre-processing data to fit the requirements of this method for more sophisticated deduplication tasks.

Conclusion

The DataFrame.duplicated() method is an invaluable tool for identifying and handling duplicate rows in Pandas DataFrames. Through these examples, from basic to advanced applications, we’ve seen the versatility and power of this method in data cleaning and preprocessing tasks. By incorporating duplicated() into your data workflow, you can ensure more accurate and reliable data for analysis.