Pandas: Mastering DataFrame.groupby() method (8 examples)

Updated: February 20, 2024 By: Guest Contributor Post a comment

Introduction

Pandas is a cornerstone library in Python data analysis and data science work. Among its many features, the groupby() method stands out for its ability to group data for aggregation, transformation, filtration, and more. In this tutorial, we will delve into the groupby() method with 8 progressive examples. By the end, you will have a solid understanding of how to leverage this powerful tool in your data analysis tasks.

Purpose of groupby()

The groupby() method is used to split the data into groups based on some criteria. Python and Pandas then allow us to apply a function to each group independently. This operation follows the split-apply-combine strategy.

Example 1: Basic Grouping

First, we start with the most basic example of grouping by a single column.

import pandas as pd
import numpy as np

np.random.seed(2024)

df = pd.DataFrame({
    'A': ['foo', 'bar', 'foo', 'bar'],
    'B': [1, 2, 3, 4],
    'C': np.random.randn(4),
})

print(df.groupby('A').sum())

This will sum values in column B and C for each group (‘foo’ and ‘bar’).

Output:

     B         C
A               
bar  6  0.586436
foo  4  1.466510

Example 2: Grouping by Multiple Columns

Next, we’ll group by more than one column to see how the grouping keys get more specific.

import pandas as pd
import numpy as np

np.random.seed(2024)

df = pd.DataFrame({
    'A': ['foo', 'bar', 'foo', 'bar'],
    'B': [1, 2, 3, 4],
    'C': np.random.randn(4),
})

df['D'] = ['one', 'one', 'two', 'three'] 
print(df.groupby(['A', 'D']).sum()) 

Output:

           B         C
A   D                 
bar one    2  0.737348
    three  4 -0.150912
foo one    1  1.668047
    two    3 -0.201538

In this example, the groups become (‘foo’, ‘one’), (‘foo’, ‘two’), etc. and we sum within these more granular groups.

Example 3: Aggregating Using Different Functions

Now, let’s see how to use different functions on the grouped data.

import pandas as pd
import numpy as np

np.random.seed(2024)

df = pd.DataFrame({
    'A': ['foo', 'bar', 'foo', 'bar'],
    'B': [1, 2, 3, 4],
    'C': np.random.randn(4),
})

grouped = df.groupby('A') 
print(grouped.aggregate({'B': 'min', 'C': 'max'})) 

Output:

     B         C
A               
bar  2  0.737348
foo  1  1.668047

Here, we specify which function to apply to each column in our aggregated output.

Example 4: Custom Aggregation

It’s possible to define your custom aggregation functions too.

import pandas as pd
import numpy as np

np.random.seed(2024)

df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar"],
        "B": [1, 2, 3, 4],
        "C": np.random.randn(4),
    }
)


def my_agg(series):
    return series.max() - series.min()


print(df.groupby("A").aggregate({"B": my_agg}))

Output:

     B
A     
bar  2
foo  2

This code defines a custom function that calculates the range (max-min) for each group.

Example 5: Filtering After GroupBy

You can also filter groups after they are formed based on some criteria.

import pandas as pd
import numpy as np

np.random.seed(2024)

df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar"],
        "B": [1, 2, 3, 4],
        "C": np.random.randn(4),
    }
)


result = df.groupby('A').filter(lambda x: x['B'].sum() > 5)
print(result)

Output:

     A  B         C
1  bar  2  0.737348
3  bar  4 -0.150912

This filters out groups where the sum of B is not greater than 5.

Example 6: Applying Multiple Functions at Once

Note: This example uses the same DataFrame as the previous one.

Applying multiple functions to groups simultaneously is quite straightforward.

result = df.groupby('A').agg(['sum', 'mean'])
print(result)

Output:

      B              C          
    sum mean       sum      mean
A                               
bar   6  3.0  0.586436  0.293218
foo   4  2.0  1.466510  0.733255

This provides a sum and mean for both B and C columns across our groups in one go.

Example 7: Transformation of Group Data

Note: This example uses the same DataFrame as the previous one.

Transformation allows you to perform some computation on the groups and return an object that is the same size as the group chunk.

df['E'] = df.groupby('A')['B'].transform(lambda x: x - x.mean()) 
print(df) 

Output:

     A  B         C    E
0  foo  1  1.668047 -1.0
1  bar  2  0.737348 -1.0
2  foo  3 -0.201538  1.0
3  bar  4 -0.150912  1.0

This code centers column B in every group around their group mean.

Example 8: Using GroupBy with Time Data

Lastly, let’s see the power of groupby() with time-series data.

import pandas as pd

times = pd.date_range('2023-01-01', periods=4, freq='M')
df = pd.DataFrame({
    'date': times,
    'key': ['A', 'B', 'A', 'B'],
    'value': [10, 20, 30, 40],
})

# Group by the month of the date and sum only the 'value' column
monthly_sum = df.groupby(df['date'].dt.month)['value'].sum()
print(monthly_sum)

Output:

1    10
2    20
3    30
4    40
Name: value, dtype: int64

This groups our data by month and sums the values within each month.

Conclusion

The groupby() method is a flexible and powerful tool for data analysis. Through these examples, we’ve seen its capability to perform aggregation, transformation, and filtering, proving its indispensable role in Python’s data manipulation ecosystem. With practice, you’ll discover even more ways to use groupby() to streamline your data analysis workflows.