Pandas time series: Find the sum/avg/min/max of each day/month/year

Updated: February 22, 2024 By: Guest Contributor Post a comment

Introduction

Pandas is a powerhouse tool for data analysis in Python, providing high-performance, easy-to-use data structures. Among its versatile features, time series analysis stands out, allowing users to effortlessly manipulate date and time-based data. In this comprehensive tutorial, we’ll explore how to find the sum, average, minimum, and maximum of values for each day, month, and year within a Pandas DataFrame.

Getting Started

Before diving into time series operations, ensure you have Pandas installed in your environment:

pip install pandas

For timeseries data manipulation, it’s also recommended to have dateutil:

pip install python-dateutil

Lets begin by creating a sample time series data:

import pandas as pd
import numpy as np

# Create a date range
date_rng = pd.date_range(start='1/1/2022', end='12/31/2022', freq='D')
# Create a sample DataFrame
df = pd.DataFrame(date_rng, columns=['date'])
df['data'] = np.random.rand(len(date_rng))

Setting the DateTimeIndex

For effective time series analysis, it’s essential to set the dataframe’s index to a DatetimeIndex:

df.set_index('date', inplace=True)

Now, our DataFrame is ready for time-based grouping operations.

Sum of Values by Time Period

To calculate the sum of values for each day, month, or year, we use the resample() method:

# Daily sum
df.resample('D').sum()
# Monthly sum
df.resample('M').sum()
# Yearly sum
df.resample('Y').sum()

The 'D', 'M', and 'Y' characters represent daily, monthly, and yearly frequencies, respectively.

Average Value by Time Period

Finding the average (mean) follows a similar pattern, utilizing the .mean() method after resampling:

# Daily average
df.resample('D').mean()
# Monthly average
df.resample('M').mean()
# Yearly average
df.resample('Y').mean()

Minimum and Maximum Values by Time Period

To discover the minimum and maximum values within each period, use the .min() and .max() methods:

# Daily minimum
df.resample('D').min()
# Monthly minimum
df.resample('M').min()
# Yearly minimum
df.resample('Y').min()

# Daily maximum
df.resample('D').max()
# Monthly maximum
df.resample('M').max()
# Yearly maximum
df.resample('Y').max()

Visualizing Time Series Data

Visualizing your time series data can provide insights that are not easily visible through summarization alone. Using Pandas integration with Matplotlib, plot the monthly averages:

import matplotlib.pyplot as plt
df.resample('M').mean().plot()
plt.title('Monthly Average Data')
plt.xlabel('Month')
plt.ylabel('Avg Data')
plt.show()

Advanced Time Series Analysis

Beyond the basics, you might be interested in calculating rolling averages, performing seasonal decompositions, or predictive modeling with time series data. Pandas, in combination with the statsmodels library, can facilitate these more complex tasks:

pip install statsmodels

Here’s how to calculate a simple 7-day rolling average:

df['7-day rolling avg'] = df['data'].rolling(window=7).mean()

You can then visualize this alongside our initial data:

df[['data', '7-day rolling avg']].plot()
plt.title('7-Day Rolling Average')
plt.xlabel('Date')
plt.ylabel('Data')
plt.show()

The realm of time series analysis in Pandas is vast, offering a broad spectrum of methods to manipulate and analyze datetime data effectively. Whether your interest lies in simple summarization techniques or advanced statistical analysis, Pandas serves as a critical tool to achieve your data analysis goals efficiently.

Wrapping up, this tutorial has walked you through basic and some intermediate techniques in handling time series data using Pandas. By mastering these skills, you are now better equipped to tackle real-world data analysis challenges with confidence.