Introduction
Pandas is a cornerstone library in Python data analysis and data science work. Among its many features, the groupby()
method stands out for its ability to group data for aggregation, transformation, filtration, and more. In this tutorial, we will delve into the groupby()
method with 8 progressive examples. By the end, you will have a solid understanding of how to leverage this powerful tool in your data analysis tasks.
Purpose of groupby()
The groupby()
method is used to split the data into groups based on some criteria. Python and Pandas then allow us to apply a function to each group independently. This operation follows the split-apply-combine strategy.
Example 1: Basic Grouping
First, we start with the most basic example of grouping by a single column.
import pandas as pd
import numpy as np
np.random.seed(2024)
df = pd.DataFrame({
'A': ['foo', 'bar', 'foo', 'bar'],
'B': [1, 2, 3, 4],
'C': np.random.randn(4),
})
print(df.groupby('A').sum())
This will sum values in column B and C for each group (‘foo’ and ‘bar’).
Output:
B C
A
bar 6 0.586436
foo 4 1.466510
Example 2: Grouping by Multiple Columns
Next, we’ll group by more than one column to see how the grouping keys get more specific.
import pandas as pd
import numpy as np
np.random.seed(2024)
df = pd.DataFrame({
'A': ['foo', 'bar', 'foo', 'bar'],
'B': [1, 2, 3, 4],
'C': np.random.randn(4),
})
df['D'] = ['one', 'one', 'two', 'three']
print(df.groupby(['A', 'D']).sum())
Output:
B C
A D
bar one 2 0.737348
three 4 -0.150912
foo one 1 1.668047
two 3 -0.201538
In this example, the groups become (‘foo’, ‘one’), (‘foo’, ‘two’), etc. and we sum within these more granular groups.
Example 3: Aggregating Using Different Functions
Now, let’s see how to use different functions on the grouped data.
import pandas as pd
import numpy as np
np.random.seed(2024)
df = pd.DataFrame({
'A': ['foo', 'bar', 'foo', 'bar'],
'B': [1, 2, 3, 4],
'C': np.random.randn(4),
})
grouped = df.groupby('A')
print(grouped.aggregate({'B': 'min', 'C': 'max'}))
Output:
B C
A
bar 2 0.737348
foo 1 1.668047
Here, we specify which function to apply to each column in our aggregated output.
Example 4: Custom Aggregation
It’s possible to define your custom aggregation functions too.
import pandas as pd
import numpy as np
np.random.seed(2024)
df = pd.DataFrame(
{
"A": ["foo", "bar", "foo", "bar"],
"B": [1, 2, 3, 4],
"C": np.random.randn(4),
}
)
def my_agg(series):
return series.max() - series.min()
print(df.groupby("A").aggregate({"B": my_agg}))
Output:
B
A
bar 2
foo 2
This code defines a custom function that calculates the range (max-min) for each group.
Example 5: Filtering After GroupBy
You can also filter groups after they are formed based on some criteria.
import pandas as pd
import numpy as np
np.random.seed(2024)
df = pd.DataFrame(
{
"A": ["foo", "bar", "foo", "bar"],
"B": [1, 2, 3, 4],
"C": np.random.randn(4),
}
)
result = df.groupby('A').filter(lambda x: x['B'].sum() > 5)
print(result)
Output:
A B C
1 bar 2 0.737348
3 bar 4 -0.150912
This filters out groups where the sum of B is not greater than 5.
Example 6: Applying Multiple Functions at Once
Note: This example uses the same DataFrame as the previous one.
Applying multiple functions to groups simultaneously is quite straightforward.
result = df.groupby('A').agg(['sum', 'mean'])
print(result)
Output:
B C
sum mean sum mean
A
bar 6 3.0 0.586436 0.293218
foo 4 2.0 1.466510 0.733255
This provides a sum and mean for both B and C columns across our groups in one go.
Example 7: Transformation of Group Data
Note: This example uses the same DataFrame as the previous one.
Transformation allows you to perform some computation on the groups and return an object that is the same size as the group chunk.
df['E'] = df.groupby('A')['B'].transform(lambda x: x - x.mean())
print(df)
Output:
A B C E
0 foo 1 1.668047 -1.0
1 bar 2 0.737348 -1.0
2 foo 3 -0.201538 1.0
3 bar 4 -0.150912 1.0
This code centers column B in every group around their group mean.
Example 8: Using GroupBy with Time Data
Lastly, let’s see the power of groupby()
with time-series data.
import pandas as pd
times = pd.date_range('2023-01-01', periods=4, freq='M')
df = pd.DataFrame({
'date': times,
'key': ['A', 'B', 'A', 'B'],
'value': [10, 20, 30, 40],
})
# Group by the month of the date and sum only the 'value' column
monthly_sum = df.groupby(df['date'].dt.month)['value'].sum()
print(monthly_sum)
Output:
1 10
2 20
3 30
4 40
Name: value, dtype: int64
This groups our data by month and sums the values within each month.
Conclusion
The groupby()
method is a flexible and powerful tool for data analysis. Through these examples, we’ve seen its capability to perform aggregation, transformation, and filtering, proving its indispensable role in Python’s data manipulation ecosystem. With practice, you’ll discover even more ways to use groupby()
to streamline your data analysis workflows.