Pandas: How to append DataFrame rows to an existing CSV file

Overview
Getting Started
Basic: Appending DataFrame to an Existing CSV
Handling Different Column Orders
Checking for Duplicates Before Appending
Advanced: Using a Temporary DataFrame
Conclusion

Overview

Appending data to an existing CSV file is a common operation in data processing and analysis. This enables the incremental update of the dataset without the need to read and rewrite the entire file. In this tutorial, we will explore how to use Pandas, a powerful and flexible data manipulation library in Python, to append DataFrame rows to an existing CSV file.

Getting Started

Before diving into appending operations, let’s briefly review what a DataFrame is. A DataFrame is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns) in Pandas. It can be thought of as a dict-like container for Series objects. Creating a DataFrame is simple and can be done in multiple ways, including from dictionaries, lists of tuples, or even CSV files.

import pandas as pd

data = {
  'Name': ['John', 'Anna', 'Peter', 'Linda'],
  'Age': [28, 34, 29, 32],
  'City': ['New York', 'Paris', 'Berlin', 'London']
}

df = pd.DataFrame(data)
print(df)

This will output:

    Name  Age      City
0   John   28  New York
1   Anna   34     Paris
2  Peter   29    Berlin
3  Linda   32    London

Basic: Appending DataFrame to an Existing CSV

Now that we are familiar with creating a DataFrame, let’s look at how to append it to an existing CSV file. Assume we have a CSV file named data.csv containing similar information. The simplest way to append data from a DataFrame to a CSV file is by using the to_csv method with the mode argument set to 'a' (append) and header argument set to False to prevent writing the column headers again.

df.to_csv('data.csv', mode='a', header=False, index=False)

This line of code adds the DataFrame rows to the end of ‘data.csv’ without including the headers or the index columns. It’s a straightforward and efficient way to append data.

Handling Different Column Orders

When working with real-world datasets, it’s not uncommon to append data where the column order in the DataFrame doesn’t match the column order in the CSV file. Pandas allows us to handle this efficiently by specifying the columns parameter in the to_csv method to match the target CSV’s column order.

df = df[['City', 'Name', 'Age']]  # Reordering DataFrame columns
df.to_csv('data.csv', mode='a', header=False, index=False)

By reordering the DataFrame’s columns before appending, we ensure the data integrity in the ‘data.csv’ file.

Checking for Duplicates Before Appending

In some cases, you may want to ensure that the data being appended does not contain duplicates. Pandas makes it easy to check and remove duplicates before the appending operation using the drop_duplicates method.

df.drop_duplicates(subset=['Name'], keep='last', inplace=True)
df.to_csv('data.csv', mode='a', header=False, index=False)

This will remove any duplicate rows based on the ‘Name’ column before appending, ensuring that your CSV file remains unique.

Advanced: Using a Temporary DataFrame

If you’re working with a large dataset and memory consumption is a concern, it’s advisable to read the existing CSV file into a temporary DataFrame, concatenate the new data, and then write back to the CSV. This approach minimizes memory usage by not keeping the entire dataset in memory.

temp_df = pd.read_csv('data.csv')
new_data_df = pd.concat([temp_df, df], ignore_index=True)
new_data_df.to_csv('data.csv', mode='w', header=True, index=False)

This method involves reading the existing data into memory, but it provides a safer way to append data by ensuring that the entire dataset is correctly formatted and duplicates can be managed more effectively.

Conclusion

Appending rows to an existing CSV file using Pandas is a straightforward task that can significantly improve the efficiency of data handling and analysis. Whether dealing with simple append operations or managing more complex scenarios involving duplicates or differing column orders, Pandas offers a flexible set of tools to accommodate these requirements. By mastering these techniques, you can efficiently handle large datasets and conduct your data analysis more effectively.

Next Article: Pandas: How to select N random rows from a DataFrame

Previous Article: Pandas DataFrame: Convert column of ISO date strings to datetime

Series: DateFrames in Pandas

Pandas

How to Use Pandas for Geospatial Data Analysis (3 examples)

February 28, 2024