Introduction
The Pandas cut()
function is a powerful tool for binning data, or converting a continuous variable into categorical bins. This tutorial will guide you through understanding and applying the cut()
function with five practical examples, ranging from basic to advanced.
What does “binning” Mean?
Before diving into the examples, it’s essential to understand what binning means and why it might be beneficial. Binning, or bucketing, is the process of transforming continuous data into categories. This technique helps in data analysis, especially when dealing with continuous variables where you want to segment the population into specific groups.
Pandas’ cut()
function offers a straightforward way to perform this operation, giving you the capability to specify the bin edges, labels, and whether the bins should be of equal width.
Example 1: Basic Binning
Let’s start with a simple example of dividing the age of individuals into categories.
import pandas as pd
import numpy as np
# Sample data
ages = np.array([25, 22, 35, 45, 55, 63, 36, 41])
# Using cut to bin data
age_categories = pd.cut(ages, bins=[20, 30, 40, 50, 60, 70], right=True)
print(age_categories)
This will output categories for each age, showing which bin they fall into, such as (20, 30] for ages 25 and 22:
[(20, 30], (20, 30], (30, 40], (40, 50], (50, 60], (60, 70], (30, 40], (40, 50]]
Categories (5, interval[int64, right]): [(20, 30] < (30, 40] < (40, 50] < (50, 60] < (60, 70]]
Example 2: Labeling Bins
It’s often helpful to label the bins for easier interpretation. Here’s how to add labels:
import pandas as pd
import numpy as np
# Sample data
ages = np.array([25, 22, 35, 45, 55, 63, 36, 41])
# Labeling the bins
age_categories = pd.cut(
ages, bins=[20, 30, 40, 50, 60, 70], labels=["20s", "30s", "40s", "50s", "60s"]
)
print(age_categories)
Now the output assigns a meaningful label to each category, such as ’20s’ for the first age bin:
['20s', '20s', '30s', '40s', '50s', '60s', '30s', '40s']
Categories (5, object): ['20s' < '30s' < '40s' < '50s' < '60s']
Example 3: Dynamic Bin Ranges with qcut
Moving beyond static bin ranges, Pandas provides the qcut()
function, which is similar to cut()
but determines bin edges based on quantiles. This can ensure a more equal distribution of data points across bins.
import pandas as pd
import numpy as np
# Sample data
ages = np.array([25, 22, 35, 45, 55, 63, 36, 41])
# Using qcut to create quartile-based bins
income = np.random.rand(100) * 100000
income_categories = pd.qcut(income, q=4)
print(income_categories)
This dynamically creates four bins with approximately equal numbers of individuals in each.
Output:
[(56678.937, 80537.709], (24509.935, 56678.937], (80537.709, 98962.078], (24509.935, 56678.937], (80537.709, 98962.078], ..., (1613.813, 24509.935], (24509.935, 56678.937], (24509.935, 56678.937], (56678.937, 80537.709], (80537.709, 98962.078]]
Length: 100
Categories (4, interval[float64, right]): [(1613.813, 24509.935] < (24509.935, 56678.937] <
(56678.937, 80537.709] < (80537.709, 98962.078]]
Example 4: Handling Outliers with cut
Sometimes, our dataset might have outliers that we want to handle explicitly or exclude from bins. Here’s one way to do it:
import pandas as pd
import numpy as np
# Sample data
ages = np.array([25, 22, 35, 45, 55, 63, 36, 41])
# Including lowest and highest bins for outliers
temperature = [22, 25, 15, 30, 33, 40, 45, 50, 60, 10, 5]
temperature_categories = pd.cut(
temperature,
bins=[-np.inf, 15, 25, 35, np.inf],
labels=["Very Cold", "Cold", "Warm", "Very Warm"],
)
print(temperature_categories)
This assigns ‘Very Cold’ or ‘Very Warm’ categories to temperatures that fall outside the common range.
Output:
['Cold', 'Cold', 'Very Cold', 'Warm', 'Warm', ..., 'Very Warm', 'Very Warm', 'Very Warm', 'Very Cold', 'Very Cold']
Length: 11
Categories (4, object): ['Very Cold' < 'Cold' < 'Warm' < 'Very Warm']
Example 5: Custom Functions and cut
For more complex binning scenarios, you might want to use custom functions alongside cut()
. This example demonstrates custom binning logic:
import pandas as pd
import numpy as np
# Sample data
ages = np.array([25, 22, 35, 45, 55, 63, 36, 41])
# Custom binning function
def custom_binning(data_series, bins, labels=None):
# Your custom logic here
return pd.cut(data_series, bins, labels=labels)
# Applying the custom function
scores = [85, 92, 88, 70, 65, 80, 95]
custom_bins = custom_binning(
scores, bins=[50, 70, 85, 100], labels=["Average", "Good", "Excellent"]
)
print(custom_bins)
This allows for complex categorization logic, tailored to your specific data analysis needs.
Output:
['Good', 'Excellent', 'Excellent', 'Average', 'Average', 'Good', 'Excellent']
Categories (3, object): ['Average' < 'Good' < 'Excellent']
Conclusion
The Pandas cut()
function is a versatile tool for segmenting and analyzing continuous data. Through the examples provided, we’ve seen how to apply it in various scenarios, from basic binning to integrating custom logic. Mastering this function can significantly enhance your data processing and analysis capabilities.