Pandas DataFrame: Calculating sum/average of rows in each group

Updated: February 21, 2024 By: Guest Contributor Post a comment

Overview

In data analysis, one often needs to aggregate data to understand patterns or compare subsets. Pandas, a Python library for data manipulation and analysis, offers powerful tools for grouping and summarizing data. In this tutorial, we’ll explore how to calculate the sum or average of rows in each group in a Pandas DataFrame.

Getting Started

Before diving into the computations, ensure you have Pandas installed. If not, you can install it using pip:

pip install pandas

And then, import Pandas in your script:

import pandas as pd

Basic Grouping and Aggregation

First, let’s create a simple DataFrame to work with:

import pandas as pd

data = {
    'Category': ['A', 'B', 'A', 'C', 'B', 'A', 'C'],
    'Values': [10, 20, 15, 10, 25, 30, 5]
}
df = pd.DataFrame(data)
print(df)

This DataFrame stores values under two categories: Category and Values. To calculate the sum of values for each category:

grouped = df.groupby('Category')
sum_df = grouped.sum()
print(sum_df)

Output:

Category A: 55 B: 45 C: 15

Similarly, to find the average:

average_df = grouped.mean() print(average_df)

Output:

Category A: 18.33 B: 22.5 C: 7.5

Using the agg() Method

The agg() method allows for more flexibility by passing a dictionary that specifies the operations to apply to each column. Let’s calculate both the sum and average at once:

results = df.groupby('Category').agg({'Values': ['sum', 'mean']}) print(results)

Output:

Category
A: sum 55, mean 18.33
B: sum 45, mean 22.5
C: sum 15, mean 7.5

Grouping by Multiple Columns

More complex datasets might require grouping by multiple columns. Let’s enhance our DataFrame:

import pandas as pd

data = {
    'Category': ['A', 'B', 'A', 'C', 'B', 'A', 'C'],
    'Subcategory': ['X', 'X', 'Y', 'Y', 'X', 'Y', 'X'],
    'Values': [10, 20, 15, 10, 25, 30, 5]
}
df = pd.DataFrame(data)
grouped = df.groupby(['Category', 'Subcategory'])
sum_sub = grouped.sum()
print(sum_sub)

Output:

                      Values
Category Subcategory        
A        X               10
         Y               45
B        X               45
C        X                5
         Y               10

For average values:

avg_sub = grouped.mean() print(avg_sub)

Advanced Aggregation with Custom Functions

Sometimes, the aggregation you need isn’t as straightforward as sum or average. In such cases, you can apply custom functions using agg(). For instance, to calculate the range of values within each group:

range_func = lambda x: x.max() - x.min()
range_df = df.groupby('Category').agg({'Values': range_func})
print(range_df)

Output:

Category
A          20
B           5
C           5

Applying Multiple Functions to Multiple Columns

You can also apply different functions to different columns. Let’s provide an example:

data = {
    'Category': ['A', 'B', 'A', 'C', 'B', 'A', 'C'],
    'Values': [10, 20, 15, 10, 25, 30, 5],
    'Weights': [1, 2, 1.5, 2, 3, 2.5, 1]
}
df = pd.DataFrame(data)
func_dict = {'Values': ['sum', 'mean'], 'Weights': ['max', 'min']}
results = df.groupby('Category').agg(func_dict)
print(results)

This outputs the sum and average of the Values column and the maximum and minimum of the Weights column for each category:

         Values            Weights
           sum  mean      max  min
Category                           
A            55  18.33     2.5  1.0
B            45  22.50     3.0  2.0
C            15   7.50     2.0  1.0

Conclusion

Grouping and aggregating in Pandas is a potent tool for summarizing and analyzing data. By understanding how to employ these techniques, you can unlock deeper insights into your datasets. Whether you need to calculate the sum, average, or apply more complex aggregations, Pandas offers the flexibility to handle it efficiently.