Pandas: Find the cumulative min/max of each group in a DataFrame

Updated: February 21, 2024 By: Guest Contributor Post a comment

Introduction

When analyzing large datasets, especially in finance, environmental sciences, or social sciences, understanding the cumulative minimum or maximum value of a dataset grouped by categories plays a crucial role in identifying trends, outliers, or anomalies. The Python library Pandas, with its powerful and expressive data manipulation capabilities, makes these analyses straightforward and efficient.

This tutorial will guide you through finding the cumulative minimum and maximum of each group in a Pandas DataFrame with clear examples escalating from basic to advanced use cases. By the end, you will be proficient in applying these techniques to real-world data analysis tasks.

What are Cumulative Min and Max?

Before diving into the cumulative minimum or maximum, let’s first understand what we mean by these terms in the context of a DataFrame. The cumulative minimum (cummin) of a series is a sequence where each element is the minimum of all previous elements including itself up to that point. Similarly, the cumulative maximum (cummax) refers to a sequence where each element is the maximum of all elements up to that point, including itself.

Starting with a Simple Example

Lets start with creating a simple DataFrame:

import pandas as pd
df = pd.DataFrame({
    'Group': ['A', 'A', 'B', 'B', 'C', 'C'],
    'Value': [1, 3, 2, 5, 4, 6]
})
print(df)

Output:

  Group  Value
0     A      1
1     A      3
2     B      2
3     B      5
4     C      4
5     C      6

To find the cumulative minimum and maximum within each group, we first need to understand GroupBy objects in Pandas. Grouping in Pandas can be achieved with the groupby() method, after which aggregation functions can be applied to each group separately.

Finding Cumulative Min and Max

Here is how to compute the cumulative minimum:

df['CumMin'] = df.groupby('Group')['Value'].cummin()
print(df)

Output:

  Group  Value  CumMin
0     A      1      1
1     A      3      1
2     B      2      2
3     B      5      2
4     C      4      4
5     C      6      4

Similarly, for cumulative maximum:

df['CumMax'] = df.groupby('Group')['Value'].cummax()
print(df)

Output:

  Group  Value  CumMax
0     A      1      1
1     A      3      3
2     B      2      2
3     B      5      5
4     C      4      4
5     C      6      6

Advanced Use Cases

Now that we have grasped the basics, let’s proceed to more advanced scenarios. Consider a DataFrame with multiple columns where you need to compute the cumulative minimum or maximum for one group based on the values of another.

df = pd.DataFrame({
    'Group': ['A', 'A', 'B', 'B', 'C', 'C'],
    'SubGroup': ['X', 'Y', 'X', 'Y', 'X', 'Y'],
    'Value': [1, 3, 2, 5, 4, 6]
})

df['CumMin'] = df.groupby(['Group', 'SubGroup'])['Value'].cummin()
df['CumMax'] = df.groupby(['Group', 'SubGroup'])['Value'].cummax()
print(df)

Output:

  Group SubGroup  Value  CumMin  CumMax
0     A        X      1       1       1
1     A        Y      3       3       3
2     B        X      2       2       2
3     B        Y      5       5       5
4     C        X      4       4       4
5     C        Y      6       6       6

This example showcased the ability to apply cumulative functions on a more granular level by grouping on multiple columns.

Handling Missing Values

When working with real-world datasets, missing values are commonplace. It’s crucial to understand how cummin and cummax functions handle NaN (Not a Number) values. By default, these functions ignore NaN values, continuing the cumulative calculation with the next valid number.

Conclusion

Finding the cumulative minimum and maximum of each group in a DataFrame is a powerful technique for analyzing datasets in various fields. This tutorial walked you through basic to advanced examples, demonstrating the flexibility and efficiency of using Pandas for such tasks. As you’ve seen, with just a few lines of code, it’s possible to extract meaningful insights from your data, making Pandas an indispensable tool in your data analysis toolkit.