Pandas: Saving a DataFrame in multiple CSV files (each file for each group)

Updated: February 20, 2024 By: Guest Contributor Post a comment

Introduction

Pandas, a powerful and widely used data manipulation library in Python, provides various functionalities for efficient data analysis. One common requirement during data processing is the need to split a large DataFrame into smaller groups and then save these groups into separate CSV files. This tutorial will guide you through different ways to accomplish this task, catering to various scenarios and complexity levels.

Basic Concept

Before diving into code, it’s crucial to understand the groupby method offered by Pandas. This method allows us to split the data into sets based on some criteria. Once grouped, Pandas enables us to apply functions to each group individually, such as aggregation, transformation, or even exporting groups to separate files.

Let’s start with a simple example. Assume we have a DataFrame called df containing sales data from a company. This data includes columns for the sale date, product name, and the amount sold.

import pandas as pd

df = pd.DataFrame({
    'Date': ['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04'],
    'Product': ['WidgetA', 'WidgetB', 'WidgetA', 'WidgetC'],
    'Amount': [20, 30, 15, 35]
})

Exporting Groups to CSV

Suppose we want to group this data by ‘Product’ and save each group into a separate CSV file. The process involves two main steps:

  1. Grouping the DataFrame by the desired column using groupby.
  2. Iterating over each group and saving it to a CSV file.

Here’s how you can do it:

for name, group in df.groupby('Product'):
    group.to_csv(f'{name}.csv', index=False)

This loop goes through each group, with name being the name of the product and group the corresponding subset of the DataFrame. Each group is then saved to a CSV file named after the product with the .csv extension. The index=False parameter is used to prevent pandas from adding an unnamed index column to the file.

Advanced Usage

For more complex scenarios, where additional processing is required before saving the files or when dealing with a large number of groups, slightly more sophisticated techniques can be employed.

Applying Functions Before Saving

In situations where you might want to apply transformations or aggregations to each group before exporting, you can easily do so within the loop. For example, let’s say we want to add a new column to each group indicating the total sales before saving:

for name, group in df.groupby('Product'):
    group['TotalSales'] = group['Amount'].sum()
    group.to_csv(f'{name}_totals.csv', index=False)

This not only exports each group into a separate CSV but also enriches the data with additional information relevant to each product.

Handling Large Datasets

When working with large datasets, memory usage and performance become considerations. To handle such scenarios more efficiently, you can use the groupby method in combination with the chunksize parameter of the to_csv function. This allows exporting each group in chunks, reducing memory footprint:

for name, group in df.groupby('Product'):
    group.to_csv(f'{name}.csv', index=False, chunksize=1000)

Conclusion

Splitting a DataFrame into groups based on certain criteria and exporting each to a separate CSV file is a common task in data processing and analysis. This tutorial demonstrates basic to advanced techniques to achieve this using Pandas. Whether it’s a straightforward group and export operation or it involves additional data processing, Pandas provides the flexibility and efficiency required to handle the task. Applying these methods can substantially streamline your data workflow and contribute to more organized and accessible data.