Understanding Pandas cut() function (5 examples)

Updated: February 21, 2024 By: Guest Contributor Post a comment

Table Of Contents

1 Introduction

2 What does “binning” Mean?

3 Example 1: Basic Binning

4 Example 2: Labeling Bins

5 Example 3: Dynamic Bin Ranges with qcut

6 Example 4: Handling Outliers with cut

7 Example 5: Custom Functions and cut

8 Conclusion

Introduction

The Pandas cut() function is a powerful tool for binning data, or converting a continuous variable into categorical bins. This tutorial will guide you through understanding and applying the cut() function with five practical examples, ranging from basic to advanced.

What does “binning” Mean?

Before diving into the examples, it’s essential to understand what binning means and why it might be beneficial. Binning, or bucketing, is the process of transforming continuous data into categories. This technique helps in data analysis, especially when dealing with continuous variables where you want to segment the population into specific groups.

Pandas’ cut() function offers a straightforward way to perform this operation, giving you the capability to specify the bin edges, labels, and whether the bins should be of equal width.

Example 1: Basic Binning

Let’s start with a simple example of dividing the age of individuals into categories.

import pandas as pd
import numpy as np

# Sample data
ages = np.array([25, 22, 35, 45, 55, 63, 36, 41])

# Using cut to bin data
age_categories = pd.cut(ages, bins=[20, 30, 40, 50, 60, 70], right=True)
print(age_categories)

This will output categories for each age, showing which bin they fall into, such as (20, 30] for ages 25 and 22:

[(20, 30], (20, 30], (30, 40], (40, 50], (50, 60], (60, 70], (30, 40], (40, 50]]
Categories (5, interval[int64, right]): [(20, 30] < (30, 40] < (40, 50] < (50, 60] < (60, 70]]

Example 2: Labeling Bins

It’s often helpful to label the bins for easier interpretation. Here’s how to add labels:

import pandas as pd
import numpy as np

# Sample data
ages = np.array([25, 22, 35, 45, 55, 63, 36, 41])

# Labeling the bins
age_categories = pd.cut(
    ages, bins=[20, 30, 40, 50, 60, 70], labels=["20s", "30s", "40s", "50s", "60s"]
)
print(age_categories)

Now the output assigns a meaningful label to each category, such as ’20s’ for the first age bin:

['20s', '20s', '30s', '40s', '50s', '60s', '30s', '40s']
Categories (5, object): ['20s' < '30s' < '40s' < '50s' < '60s']

Example 3: Dynamic Bin Ranges with qcut

Moving beyond static bin ranges, Pandas provides the qcut() function, which is similar to cut() but determines bin edges based on quantiles. This can ensure a more equal distribution of data points across bins.

import pandas as pd
import numpy as np

# Sample data
ages = np.array([25, 22, 35, 45, 55, 63, 36, 41])

# Using qcut to create quartile-based bins
income = np.random.rand(100) * 100000
income_categories = pd.qcut(income, q=4)
print(income_categories)

This dynamically creates four bins with approximately equal numbers of individuals in each.

Output:

[(56678.937, 80537.709], (24509.935, 56678.937], (80537.709, 98962.078], (24509.935, 56678.937], (80537.709, 98962.078], ..., (1613.813, 24509.935], (24509.935, 56678.937], (24509.935, 56678.937], (56678.937, 80537.709], (80537.709, 98962.078]]
Length: 100
Categories (4, interval[float64, right]): [(1613.813, 24509.935] < (24509.935, 56678.937] <
                                           (56678.937, 80537.709] < (80537.709, 98962.078]]

Example 4: Handling Outliers with cut

Sometimes, our dataset might have outliers that we want to handle explicitly or exclude from bins. Here’s one way to do it:

import pandas as pd
import numpy as np

# Sample data
ages = np.array([25, 22, 35, 45, 55, 63, 36, 41])

# Including lowest and highest bins for outliers
temperature = [22, 25, 15, 30, 33, 40, 45, 50, 60, 10, 5]
temperature_categories = pd.cut(
    temperature,
    bins=[-np.inf, 15, 25, 35, np.inf],
    labels=["Very Cold", "Cold", "Warm", "Very Warm"],
)
print(temperature_categories)

This assigns ‘Very Cold’ or ‘Very Warm’ categories to temperatures that fall outside the common range.

Output:

['Cold', 'Cold', 'Very Cold', 'Warm', 'Warm', ..., 'Very Warm', 'Very Warm', 'Very Warm', 'Very Cold', 'Very Cold']
Length: 11
Categories (4, object): ['Very Cold' < 'Cold' < 'Warm' < 'Very Warm']

Example 5: Custom Functions and cut

For more complex binning scenarios, you might want to use custom functions alongside cut(). This example demonstrates custom binning logic:

import pandas as pd
import numpy as np

# Sample data
ages = np.array([25, 22, 35, 45, 55, 63, 36, 41])


# Custom binning function
def custom_binning(data_series, bins, labels=None):
    # Your custom logic here
    return pd.cut(data_series, bins, labels=labels)


# Applying the custom function
scores = [85, 92, 88, 70, 65, 80, 95]
custom_bins = custom_binning(
    scores, bins=[50, 70, 85, 100], labels=["Average", "Good", "Excellent"]
)
print(custom_bins)

This allows for complex categorization logic, tailored to your specific data analysis needs.

Output:

['Good', 'Excellent', 'Excellent', 'Average', 'Average', 'Good', 'Excellent']
Categories (3, object): ['Average' < 'Good' < 'Excellent']

Conclusion

The Pandas cut() function is a versatile tool for segmenting and analyzing continuous data. Through the examples provided, we’ve seen how to apply it in various scenarios, from basic binning to integrating custom logic. Mastering this function can significantly enhance your data processing and analysis capabilities.

Next Article: Pandas: Turn a DataFrame to a list of dictionaries

Previous Article: Pandas: Calculate the expanding minimum/maximum of a DataFrame

Series: DateFrames in Pandas

Pandas