Overview
Working with multiple datasets is a common scenario in data analysis and machine learning projects. Often, you’ll find yourself dealing with data spread across multiple Excel files that you need to combine into a single DataFrame for analysis. Fortunately, Pandas, a powerful data manipulation library in Python, makes this task much easier.
This tutorial will guide you through the process of combining multiple Excel files into a single DataFrame using Pandas, with examples ranging from basic to advanced. We’ll also explore some common issues you might encounter and how to solve them.
Prerequisites
- Python installed on your system.
- Pandas and openpyxl libraries installed. You can install these using pip:
pip install pandas openpyxl
Step-by-Step Instructions
Step 1: Basic Example
First, let’s start with the basic example of combining two Excel files. Assume we have two Excel files named ‘data1.xlsx’ and ‘data2.xlsx’. Both files contain data in the same format.
import pandas as pd
# Load the first Excel file
data1 = pd.read_excel('data1.xlsx')
# Load the second Excel file
data2 = pd.read_excel('data2.xlsx')
# Append data2 to data1
combined_data = pd.concat([data1, data2])
print(combined_data)
This code loads the two Excel files into separate DataFrames and then combines them using the pd.concat()
function. The result is a single DataFrame that contains the data from both files.
Step 2: Combining Multiple Files
To combine multiple Excel files located in the same directory, you can use the glob
library to match filenames and then load and combine them in a loop. Here’s how:
import pandas as pd
import glob
file_path = 'your_directory_path/*.xlsx'
files = glob.glob(file_path)
combined_data = pd.DataFrame()
for file in files:
df = pd.read_excel(file)
combined_data = pd.concat([combined_data, df])
print(combined_data)
This method is particularly useful when you have many files to combine. The glob
module helps you match filenames using patterns (in this case, all files ending with ‘.xlsx’), making it easier to automate the file loading process.
Step 3: Adding Filename as a Column
It might be helpful to know which row came from which file. You can achieve this by adding a column to each DataFrame before combining them, which stores the filename.
import pandas as pd
import glob
file_path = 'your_directory_path/*.xlsx'
files = glob.glob(file_path)
combined_data = pd.DataFrame()
for file in files:
df = pd.read_excel(file)
df['source_filename'] = file
combined_data = pd.concat([combined_data, df])
print(combined_data)
This way, the resulting DataFrame holds not just the combined data but also the source of each row, improving traceability.
Step 4: Handling Different Data Structures
In real-world scenarios, Excel files might not always have the same structure (i.e., same columns). You can handle this situation by aligning the columns during the combination process. Here’s how to do it:
import pandas as pd
import glob
file_path = 'your_directory_path/*.xlsx'
files = glob.glob(file_path)
all_columns = []
for file in files:
df = pd.read_excel(file)
all_columns = list(set(all_columns + list(df.columns)))
# Initialize an empty DataFrame with the merged column list
column_align_df = pd.DataFrame(columns=all_columns)
for file in files:
df = pd.read_excel(file, usecols=lambda column: column in all_columns)
column_align_df = pd.concat([column_align_df, df])
print(column_align_df)
This code first finds all unique column names across all files, creates a new DataFrame with those columns, and then reads and appends each file’s data, aligning it according to the merged column structure.
Step 5: Merging Data with Different Identifiers
Sometimes, you might also want to merge data on specific columns if files have different structures or identifiers. Pandas provide several functions for merging data, such as merge()
and join()
. These functions are beyond the scope of this basic guide but are invaluable tools for more complex data integration tasks.
See also:
- Pandas: How to merge 2 DataFrames
- Pandas: How to ‘RIGHT JOIN’ 2 DataFrames (with examples)
- Pandas: How to ‘INNER JOIN’ 2 DataFrames
- Pandas: How to ‘LEFT JOIN’ 2 DataFrames
Conclusion
Combining multiple Excel files into a single DataFrame with Pandas is a powerful way to streamline your data analysis workflow. Whether you’re dealing with a few files or many, following the steps outlined in this tutorial will help you efficiently manage and analyze your data. Remember, the key is to understand the structure of your data and use the appropriate Pandas functions to combine it effectively.