Pandas: How to remove duplicate values from a Series

Updated: February 22, 2024 By: Guest Contributor Post a comment

Introduction

Pandas is a crucial library in the Python ecosystem, widely used for data manipulation and analysis. Especially for Data Scientists and Analysts, mastering Pandas is a key step towards data wrangling and preprocessing tasks. One task you’ll find yourself doing quite frequently is removing duplicate values from a Series. This tutorial will take you from the basics to more advanced techniques for achieving this, complete with code examples and outputs.

Preparing a Simple Series

Before diving into removing duplicates, it’s important to understand what a Pandas Series is. A Series is a one-dimensional labeled array capable of holding any data type. Let’s start with creating a simple Series.

import pandas as pd
data = [1, 2, 2, 3, 4, 4, 5]
series = pd.Series(data)
print(series)

Output:

0 1
1 2
2 2
3 3
4 4
5 4
6 5
dtype: int64

Removing Duplicates Using drop_duplicates()

The most straightforward method to remove duplicates from a Series is to use the drop_duplicates() function. This function returns a new Series with duplicate values removed.

series_unique = series.drop_duplicates()
print(series_unique)

Output:

0 1
1 2
3 3
4 4
6 5
dtype: int64

Keeping the First or Last Occurrence

You can decide to keep the first or last occurrence of the duplicates by using the keep parameter.

Keep the first occurrence:

first_occurrence = series.drop_duplicates(keep='first')
print(first_occurrence)

Output:

0 1
1 2
3 3
4 4
6 5
dtype: int64

Keep the last occurrence:

last_occurrence = series.drop_duplicates(keep='last')
print(last_occurrence)

Output:

1 2
3 3
4 4
5 4
6 5
dtype: int64

Removing All Occurrences of Duplicates

If your goal is to completely eliminate duplicates, meaning not keeping any occurrence, you can do so by combining drop_duplicates() with the keep=False parameter.

no_duplicates = series.drop_duplicates(keep=False)
print(no_duplicates)

Output:

0 1
3 3
6 5
dtype: int64

Applying Conditions with loc or iloc

For more advanced scenarios, you might want to remove duplicates based on certain conditions. This can be achieved using loc or iloc to access a group of rows and columns by label(s) or a boolean array.

condition = series > 2
filtered_series = series.loc[condition].drop_duplicates()
print(filtered_series)

Output:

3 3
4 4
6 5
dtype: int64

Handling Duplicates In A Series With Custom Functions

Sometimes, the built-in Pandas functions might not be flexible enough for your needs. You can apply custom logic for handling duplicates by using the apply() method combined with a user-defined function. This section will guide you through creating a function to identify and remove duplicates based on personalized criteria.

Conclusion

Removing duplicates from a Pandas Series is a fundamental task that enhances the clarity and quality of your data. Whether you’re using the basic drop_duplicates() method or diving into more advanced techniques, Pandas provides you with the flexibility to handle duplicates in a way that suits your analysis needs best. Understanding these methods is a small, yet significant step, towards mastering data preprocessing tasks in Python.