Pandas: Concatenate multiple CSV files into a single DataFrame

Introduction
Prerequisites
Basic Concatenation
Concatenating Multiple Files
Advanced Concatenation with Same Structure but Different Column Names
Dealing with Different Structures
Conclusion

Introduction

Working with multiple CSV files is a common scenario in data analysis and data science projects. Often, these files are parts of a larger dataset that has been split due to its size, or they might come from different sources but have a similar structure that you want to analyze together. Pandas, a powerful data manipulation library in Python, provides an efficient way to concatenate these files into a single DataFrame for easier analysis and processing. This tutorial will guide you through various methods to achieve this, suitable for different use cases and complexity levels.

Prerequisites

Before we dive into the methods, ensure you have Pandas installed in your environment. You can install Pandas using pip:

pip install pandas

Basic Concatenation

To begin, let’s assume you have a simple case where you have two CSV files with the same structure: data1.csv and data2.csv. The goal is to concat these into one DataFrame.

import pandas as pd

data1 = pd.read_csv('data1.csv')
data2 = pd.read_csv('data2.csv')

df_combined = pd.concat([data1, data2])
print(df_combined.head())

This is the simplest form of concatenation. The pd.concat() function combines the DataFrames listed in its argument. The resulting DataFrame, df_combined, now contains data from both data1.csv and data2.csv.

Concatenating Multiple Files

When dealing with more than two files, manually adding each file to the list can be cumbersome. Let’s use a more automatic approach, ideal for situations where you have multiple CSV files in a directory:

import os
import pandas as pd

files = os.listdir('data_folder')
csv_files = [f for f in files if f.endswith('.csv')]
df_list = [pd.read_csv(f'data_folder/{file}') for file in csv_files]
df_combined = pd.concat(df_list, ignore_index=True)
print(df_combined.head())

This method first lists all files in a specified directory, filters out CSV files, reads each one into a separate DataFrame, and then combines them. The ignore_index=True parameter is crucial here as it reindexes the combined DataFrame to avoid index duplication.

Advanced Concatenation with Same Structure but Different Column Names

It’s not uncommon to encounter files with similar data but slightly different column names. For example, one file might have a column named “Date”, while another has “date”. Here’s how you can handle this:

import pandas as pd

def standardize_df(df):
    df.columns = df.columns.str.lower()
    return df

files = ['data1.csv', 'data2.csv', 'data3.csv']
df_list = []
for file in files:
    df = pd.read_csv(file)
    df = standardize_df(df)
    df_list.append(df)

df_combined = pd.concat(df_list)
print(df_combined.head())

In this advanced scenario, you define a function standardize_df to normalize column names to lowercase (or any other standard you wish to apply), ensuring that columns are correctly aligned when concatenated.

Dealing with Different Structures

Sometimes, the CSV files you’re working with might not have exactly the same structure. For example, they may have overlapping but not identical columns. Here’s how you can concatenate them while keeping all columns:

import pandas as pd

files = ['data1.csv', 'data2.csv', 'data3.csv']
df_combined = pd.DataFrame()
for file in files:
    df = pd.read_csv(file)
    df_combined = pd.concat([df_combined, df], sort=False)

print(df_combined.head())

This technique involves initializing an empty DataFrame and sequentially concatenating each file’s DataFrame into it, with sort=False to prevent Pandas from automatically sorting column names.

Conclusion

Concatenating multiple CSV files into a single DataFrame is a common task in data analysis. By using Pandas, you can handle files with the same or different structures, aligning or merging their contents as needed. This tutorial provided an overview of how to approach this task from basic to more advanced scenarios. Mastering these techniques can significantly streamline your data processing workflow.

Next Article: Pandas: Saving a DataFrame in multiple CSV files (each file for each group)

Previous Article: Pandas: Update a specific cell in DataFrame using index and column name

Series: DateFrames in Pandas

Pandas

How to Use Pandas for Geospatial Data Analysis (3 examples)

February 28, 2024