Pandas DataFrame: Counting distinct values in a column/row

Updated: February 20, 2024 By: Guest Contributor Post a comment

Introduction

Pandas is a powerful Python library for data manipulation and analysis, offering various functions and methods that make it easy to work with structured data. One common task when dealing with data is counting the distinct (or unique) values in a column or row. This information can be essential for data cleaning, preprocessing, or exploring your dataset to understand its characteristics better. In this tutorial, we will explore different ways to count distinct values in a column or row of a Pandas DataFrame.

Getting Started

Before diving into the details, ensure you have Pandas installed. If not, you can install it using pip:

pip install pandas

Once installed, import Pandas as follows:

import pandas as pd

Basic Examples

Let’s start with the basics. Suppose we have the following DataFrame (we’ll use this one through out this tutorial):

import pandas as pd

data = {
    "Name": ["John", "Ana", "Peter", "John", "Ana"],
    "Age": [28, 34, 29, 28, 34],
    "City": ["New York", "Paris", "London", "New York", "Paris"],
}
df = pd.DataFrame(data)

To count the distinct values in the ‘Name’ column, you can use the nunique() method:

print(df['Name'].nunique())
# Output: 3

This method returns the number of unique values in the column. Similarly, for the ‘Age’ and ‘City’ columns:

print(df['Age'].nunique())
# Output: 3
print(df['City'].nunique())
# Output: 3

Advanced Examples

Moving on to more advanced usage, let’s explore how to count distinct values across multiple columns or the entire DataFrame. You can use the drop_duplicates() method in conjunction with shape to achieve this:

unique_rows = df.drop_duplicates()
print(unique_rows.shape[0])
# Output: 5

This method removes duplicate rows across all columns and returns a new DataFrame. The shape[0] represents the number of rows, effectively counting the distinct rows.

You can also count distinct values in a row. This is less common, but can be useful in specific scenarios. Assuming you’re interested in the distinct values across a single row, you can use:

print(df.iloc[0].nunique())
# Output: 3

This example uses iloc to select the first row and counts the distinct values within it.

Counting Distinct Values With Conditions

Often, you might want to count distinct values under certain conditions. For instance, counting the distinct names of people over 30 years old. This can be achieved using boolean indexing combined with nunique():

print(df[df['Age'] > 30]['Name'].nunique())
# Output: 2

This filters the DataFrame to only include rows where the ‘Age’ is greater than 30 and then counts the distinct ‘Name’ values in this filtered DataFrame.

Using value_counts() for Detailed Insights

Another helpful method for working with distinct values is value_counts(), which not only tells you the number of distinct values but also how often each occurs. For instance:

name_counts = df['Name'].value_counts()
print(name_counts)
# Output: John 2
#         Ana 2
#         Peter 1

This can be particularly useful for understanding the distribution of values in your dataset.

Conclusion

Counting distinct values in a Pandas DataFrame is a crucial task for data analysis, allowing us to understand the variability and uniqueness of our data. Whether you’re cleaning data, preprocessing, or simply exploring, methods like nunique(), drop_duplicates(), and value_counts() make it easy to get insightful metrics on your data’s unique characteristics. By mastering these tools, you can extract meaningful insights and make informed decisions based on your dataset’s distinct values.