Sling Academy
Home/Pandas/Pandas DataFrame: Calculating sum/average of rows in each group

Pandas DataFrame: Calculating sum/average of rows in each group

Last updated: February 21, 2024

Overview

In data analysis, one often needs to aggregate data to understand patterns or compare subsets. Pandas, a Python library for data manipulation and analysis, offers powerful tools for grouping and summarizing data. In this tutorial, we’ll explore how to calculate the sum or average of rows in each group in a Pandas DataFrame.

Getting Started

Before diving into the computations, ensure you have Pandas installed. If not, you can install it using pip:

pip install pandas

And then, import Pandas in your script:

import pandas as pd

Basic Grouping and Aggregation

First, let’s create a simple DataFrame to work with:

import pandas as pd

data = {
    'Category': ['A', 'B', 'A', 'C', 'B', 'A', 'C'],
    'Values': [10, 20, 15, 10, 25, 30, 5]
}
df = pd.DataFrame(data)
print(df)

This DataFrame stores values under two categories: Category and Values. To calculate the sum of values for each category:

grouped = df.groupby('Category')
sum_df = grouped.sum()
print(sum_df)

Output:

Category A: 55 B: 45 C: 15

Similarly, to find the average:

average_df = grouped.mean() print(average_df)

Output:

Category A: 18.33 B: 22.5 C: 7.5

Using the agg() Method

The agg() method allows for more flexibility by passing a dictionary that specifies the operations to apply to each column. Let’s calculate both the sum and average at once:

results = df.groupby('Category').agg({'Values': ['sum', 'mean']}) print(results)

Output:

Category
A: sum 55, mean 18.33
B: sum 45, mean 22.5
C: sum 15, mean 7.5

Grouping by Multiple Columns

More complex datasets might require grouping by multiple columns. Let’s enhance our DataFrame:

import pandas as pd

data = {
    'Category': ['A', 'B', 'A', 'C', 'B', 'A', 'C'],
    'Subcategory': ['X', 'X', 'Y', 'Y', 'X', 'Y', 'X'],
    'Values': [10, 20, 15, 10, 25, 30, 5]
}
df = pd.DataFrame(data)
grouped = df.groupby(['Category', 'Subcategory'])
sum_sub = grouped.sum()
print(sum_sub)

Output:

                      Values
Category Subcategory        
A        X               10
         Y               45
B        X               45
C        X                5
         Y               10

For average values:

avg_sub = grouped.mean() print(avg_sub)

Advanced Aggregation with Custom Functions

Sometimes, the aggregation you need isn’t as straightforward as sum or average. In such cases, you can apply custom functions using agg(). For instance, to calculate the range of values within each group:

range_func = lambda x: x.max() - x.min()
range_df = df.groupby('Category').agg({'Values': range_func})
print(range_df)

Output:

Category
A          20
B           5
C           5

Applying Multiple Functions to Multiple Columns

You can also apply different functions to different columns. Let’s provide an example:

data = {
    'Category': ['A', 'B', 'A', 'C', 'B', 'A', 'C'],
    'Values': [10, 20, 15, 10, 25, 30, 5],
    'Weights': [1, 2, 1.5, 2, 3, 2.5, 1]
}
df = pd.DataFrame(data)
func_dict = {'Values': ['sum', 'mean'], 'Weights': ['max', 'min']}
results = df.groupby('Category').agg(func_dict)
print(results)

This outputs the sum and average of the Values column and the maximum and minimum of the Weights column for each category:

         Values            Weights
           sum  mean      max  min
Category                           
A            55  18.33     2.5  1.0
B            45  22.50     3.0  2.0
C            15   7.50     2.0  1.0

Conclusion

Grouping and aggregating in Pandas is a potent tool for summarizing and analyzing data. By understanding how to employ these techniques, you can unlock deeper insights into your datasets. Whether you need to calculate the sum, average, or apply more complex aggregations, Pandas offers the flexibility to handle it efficiently.

Next Article: Pandas DataFrame: Finding min/max value in each group

Previous Article: Pandas DataFrame: Counting rows in each group (4 examples)

Series: DateFrames in Pandas

Pandas

You May Also Like

  • How to Use Pandas Profiling for Data Analysis (4 examples)
  • How to Handle Large Datasets with Pandas and Dask (4 examples)
  • Pandas – Using DataFrame.pivot() method (3 examples)
  • Pandas: How to ‘FULL JOIN’ 2 DataFrames (3 examples)
  • Pandas: Select columns whose names start/end with a specific string (4 examples)
  • 3 ways to turn off future warnings in Pandas
  • How to Integrate Pandas with Apache Spark
  • How to Use Pandas for Web Scraping and Saving Data (2 examples)
  • How to Clean and Preprocess Text Data with Pandas (3 examples)
  • Pandas – Using Series.replace() method (3 examples)
  • Pandas json_normalize() function: Explained with examples
  • Pandas: Reading CSV and Excel files from AWS S3 (4 examples)
  • Using pandas.Series.rank() method (4 examples)
  • Pandas: Dropping columns whose names contain a specific string (4 examples)
  • Pandas: How to print a DataFrame without index (3 ways)
  • Fixing Pandas NameError: name ‘df’ is not defined
  • Pandas – Using DataFrame idxmax() and idxmin() methods (4 examples)
  • Pandas FutureWarning: ‘M’ is deprecated and will be removed in a future version, please use ‘ME’ instead
  • Pandas: Checking equality of 2 DataFrames (element-wise)