Pandas: Reading CSV and Excel files from AWS S3 (4 examples)

Updated: February 25, 2024 By: Guest Contributor Post a comment

Overview

Cloud storage services like AWS S3 have become a popular means for storing data files due to their reliability, scalability, and security. Pandas, a powerful data analysis and manipulation library for Python, allows developers to easily manipulate various data formats. Combining the two, you can efficiently manage and analyze data stored in the cloud. In this article, we will guide you through four examples of how to read CSV and Excel files from AWS S3 using Pandas.

Prerequisites

  • Python installed on your machine.
  • Pandas and Boto3 libraries installed. You can install them using pip install pandas boto3.
  • An AWS account with access and secret keys.

Example 1: Basic CSV File Reading

First, ensure that your AWS credentials are set up correctly. You can do this by setting the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, or by configuring AWS CLI with aws configure.

import pandas as pd
import boto3
from io import StringIO

# Create an S3 client
s3 = boto3.client('s3', aws_access_key_id='YOUR_ACCESS_KEY', aws_secret_access_key='YOUR_SECRET_KEY')

# Specify bucket and object key
bucket_name = 'your-bucket-name'
file_name = 'your-file-name.csv'

# Get the object
obj = s3.get_object(Bucket=bucket_name, Key=file_name)

# Read the data into a pandas DataFrame
df = pd.read_csv(StringIO(obj['Body'].read().decode('utf-8')))

print(df.head())

Example 2: CSV File with Custom Delimiter

Reading a CSV file with a non-standard delimiter is also straightforward with Pandas and S3. Let’s use a semicolon (;) as an example.

import pandas as pd
import boto3
from io import StringIO

s3 = boto3.client('s3')

bucket_name = 'your-bucket-name-again'
file_name = 'your-semicolon-delimited-file.csv'

obj = s3.get_object(Bucket=bucket_name, Key=file_name)

df = pd.read_csv(StringIO(obj['Body'].read().decode('utf-8')), delimiter=';')

print(df.head())

Example 3: Reading an Excel File

Reading Excel files is as easy as reading CSV files. However, you need to have openpyxl or xlrd installed depending on the Excel file’s version.

import pandas as pd
import boto3
from io import BytesIO

s3 = boto3.client('s3')

bucket_name = 'your-excel-file-bucket'
file_name = 'your-excel-file.xlsx'

obj = s3.get_object(Bucket=bucket_name, Key=file_name)

df = pd.read_excel(BytesIO(obj['Body'].read()), engine='openpyxl')

print(df.head())

Example 4: Reading CSV Files in Chunks

When dealing with large files that might not fit into memory, Pandas allows you to read the file in chunks. Here is how you can do it with a CSV file from S3.

import pandas as pd
import boto3
from io import StringIO

s3 = boto3.client('s3')

bucket_name = 'your-large-file-bucket'
file_name = 'your-large-csv-file.csv'

obj = s3.get_object(Bucket=bucket_name, Key=file_name)
csv_content = obj['Body'].read().decode('utf-8')

chunk_size = 10000 # Number of lines
for chunk in pd.read_csv(StringIO(csv_content), chunksize=chunk_size):
    print(chunk.head())

These examples showcase the basic methods to read data from AWS S3 into Pandas DataFrames, offering a solid foundation for further data analysis and manipulation. Whether handling CSV or Excel files, small or large datasets, the combination of Pandas and AWS S3 provides a robust solution for data scientists and developers.