Sling Academy
Home/Pandas/Pandas DataFrame: Counting unique values in each group

Pandas DataFrame: Counting unique values in each group

Last updated: February 21, 2024

Overview

Working with Pandas DataFrames is a fundamental skill for any data scientist or analyst. A common operation when analyzing data is grouping data and calculating statistics on these groups. In this tutorial, we will focus on how to count unique values in each group using a Pandas DataFrame. This operation is useful when you need to understand the diversity or variance within groups in your dataset.

We’ll begin with basic examples and gradually move on to more advanced scenarios. Examples will cover a range of techniques from using groupby and nunique methods to applying more complex functions for deeper insights into your grouped data.

Introduction to Grouping in Pandas

Before we delve into counting unique values, let’s quickly review how to group data in a Pandas DataFrame. Grouping involves one or more keys by which the data is split into groups. Each group can then be aggregated or transformed independently.

import pandas as pd

# Sample DataFrame
data = {'Category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'C'],
        'Values': [1, 2, 2, 3, 3, 1, 2, 3]}
df = pd.DataFrame(data)

The simplest way to group data is by using the groupby method. For example, if we want to group by the ‘Category’ column:

grouped_df = df.groupby('Category')

Basic Example: Counting Unique Values

Counting unique values in each group can be achieved using the nunique method. This method returns the count of unique values for each column in each group:

unique_counts = grouped_df['Values'].nunique()
print(unique_counts)

This results in:

Category
A    2
B    2
C    2
Name: Values, dtype: int64

As you can see, each category has two unique values in the ‘Values’ column.

Intermediate Example: Custom Groups and Multiple Columns

Let’s explore how to count unique values in custom groupings and across multiple columns. Suppose we have an additional ‘Subcategory’ column and we want to group by both ‘Category’ and ‘Subcategory’:

data['Subcategory'] = ['X', 'X', 'Y', 'Y', 'X', 'Y', 'X', 'Y']
df = pd.DataFrame(data)
grouped_df = df.groupby(['Category', 'Subcategory'])

To count unique values across multiple columns, use:

unique_counts = grouped_df[['Values', 'Subcategory']].nunique()
print(unique_counts)

This yields:

                     Values  Subcategory
Category Subcategory                  
A        X              1          1
         Y              1          1
B        X              1          1
C        Y              1          1
         X              2          1

Notice how the count of unique ‘Values’ and ‘Subcategory’ changes depending on the group.

Advanced Example: Using agg with Custom Functions

For even more control over how unique values are counted, you can use the agg method with custom functions. This is especially useful when you want to apply different aggregations for different columns or process the aggregated results further.

def count_unique(series):
    return len(series.unique())

unique_counts = grouped_df.agg({
    'Values': [count_unique, 'nunique'],
    'Subcategory': 'nunique'
})
print(unique_counts)

The output will display the result of both the custom count_unique function and the built-in nunique method for comparison.

Grouping and Counting Unique Values with Complex Conditions

For scenarios with complex grouping criteria or where the data needs to be filtered before counting, Pandas offers robust tools to apply conditions and filters within your groupby operations.

custom_group = df.groupby('Category').filter(lambda x: x['Values'].nunique() > 1)
print(custom_group.groupby('Category')['Values'].nunique())

This will filter groups with more than one unique value before counting unique values, offering a focused analysis on groups of interest.

Visualizing the Counts of Unique Values

In addition to numerical analysis, visual representations can provide intuitive insights into the distribution of unique counts across groups. Utilizing libraries like Matplotlib or Seaborn, we can easily plot the counts of unique values for a more comprehensive understanding.

import matplotlib.pyplot as plt
import seaborn as sns

# Plotting unique counts
counts = df.groupby('Category')['Values'].nunique().reset_index()
sns.barplot(x='Category', y='Values', data=counts)
plt.show()

Conclusion

Counting unique values within groups in a Pandas DataFrame is a powerful tool for data analysis, providing insights into the diversity and variance of data subsets. As we’ve explored from basic to advanced examples, there are multiple ways to approach this operation, each offering unique perspectives on the data. With the powerful grouping and aggregation capabilities of Pandas, detailed and complex data analysis tasks can be performed efficiently and effectively.

Next Article: Pandas DataFrame: Grouping rows by day of the week

Previous Article: Pandas DataFrame: Finding min/max value in each group

Series: DateFrames in Pandas

Pandas

You May Also Like

  • How to Use Pandas Profiling for Data Analysis (4 examples)
  • How to Handle Large Datasets with Pandas and Dask (4 examples)
  • Pandas – Using DataFrame.pivot() method (3 examples)
  • Pandas: How to ‘FULL JOIN’ 2 DataFrames (3 examples)
  • Pandas: Select columns whose names start/end with a specific string (4 examples)
  • 3 ways to turn off future warnings in Pandas
  • How to Integrate Pandas with Apache Spark
  • How to Use Pandas for Web Scraping and Saving Data (2 examples)
  • How to Clean and Preprocess Text Data with Pandas (3 examples)
  • Pandas – Using Series.replace() method (3 examples)
  • Pandas json_normalize() function: Explained with examples
  • Pandas: Reading CSV and Excel files from AWS S3 (4 examples)
  • Using pandas.Series.rank() method (4 examples)
  • Pandas: Dropping columns whose names contain a specific string (4 examples)
  • Pandas: How to print a DataFrame without index (3 ways)
  • Fixing Pandas NameError: name ‘df’ is not defined
  • Pandas – Using DataFrame idxmax() and idxmin() methods (4 examples)
  • Pandas FutureWarning: ‘M’ is deprecated and will be removed in a future version, please use ‘ME’ instead
  • Pandas: Checking equality of 2 DataFrames (element-wise)