Pandas DataFrame: How to describe summary stats of each group

Updated: February 21, 2024 By: Guest Contributor Post a comment

Introduction

In data science and analysis, understanding the statistical properties of your data is paramount. With Python’s Pandas library, specifically using DataFrames, you get a powerful tool for slicing, dicing, and summarizing datasets. This tutorial will guide you through the process of grouping your data and computing summary statistics for each group, providing insights that could be the difference between a good analysis and a great one.

Before we dive in, make sure you have Pandas installed. You can do that using pip:

pip install pandas

Getting Started

Let’s begin by importing Pandas and creating a simple DataFrame to work with:

import pandas as pd

df = pd.DataFrame({
    'Category': ['A', 'B', 'A', 'B', 'C', 'A', 'C', 'B'],
    'Values': [10, 20, 15, 30, 40, 20, 45, 10],
    'Ids': [1, 2, 1, 2, 3, 3, 2, 1]
})

Basic Grouping and Summary Statistics

First up, let’s group our DataFrame by the ‘Category’ column and calculate some basic summary statistics:

grouped = df.groupby('Category')
summary = grouped.describe()
print(summary)

This code will group the data by ‘Category’ and apply the describe() function, which computes a plethora of statistical measures including count, mean, std (standard deviation), min, max, and quartiles for each group. The output will look something like this:

# Output
        Values                                    Ids                            
         count mean        std  min   25%   50%   75%   max  count mean        std  min   25%   50%   75%   max
Category                                                                                 
A            3  15.0  5.773503  10  12.5  15.0  17.5  20.0      3  2.0  1.0    1    1.5  2.0  2.5  3.0
B            3  20.0  10.000000 10  15.0  20.0  25.0  30.0      3  1.666667  0.577350  1    1.5  2.0  2.0  2.0
C            2  42.5  3.535534  40  41.25 42.5  43.75 45.0      2  2.5  0.707107  2    2.25 2.5  2.75 3.0

Advanced Group By Operations

For more detailed analysis, you can perform operations specific to columns within groups. Let’s calculate the sum and average of ‘Values’ for each ‘Category,’ while counting the unique ‘Ids’:

advanced_grouping = df.groupby('Category').agg({
    'Values': ['sum', 'mean'],
    'Ids': pd.Series.nunique
})
print(advanced_grouping)

The output showcases the flexibility of the groupby operation, allowing for tailored statistical computations:

# Output
             Values           Ids
                sum mean nunique
Category                           
A                45   15       2
B                60   20       2
C                85   42.5     2

Visualizing Summary Statistics

Visual representation can make data insights more immediate and impactful. Using Pandas’ integration with Matplotlib, you can easily visualize these statistics. To plot the mean ‘Values’ for each category:

import matplotlib.pyplot as plt

grouped['Values'].mean().plot(kind='bar')
plt.show()

The resulting bar chart provides a clear visual comparison of the mean values across the groups.

Custom Aggregation Functions

For even more customized analysis, you can define your own aggregation functions. For instance, to create a function that calculates the range (max – min) of ‘Values’ in each group:

def calc_range(series):
    return series.max() - series.min()

grouped = df.groupby('Category').agg({
    'Values': calc_range
})
print(grouped)

This showcases the power of custom functions in Pandas’ aggregation framework, providing output that is both tailored and insightful:

# Output
         Values
Category       
A             10
B             20
C              5

Conclusion

Through the techniques demonstrated in this tutorial, Pandas empowers data scientists and analysts to derive meaningful insights from grouped data with remarkable ease and flexibility. Whether for basic summarization or advanced custom computations, the capability to eloquently wield these tools is essential for effective data analysis.