Sling Academy
Home/Pandas/Pandas: Reading CSV and Excel files from AWS S3 (4 examples)

Pandas: Reading CSV and Excel files from AWS S3 (4 examples)

Last updated: February 25, 2024

Overview

Cloud storage services like AWS S3 have become a popular means for storing data files due to their reliability, scalability, and security. Pandas, a powerful data analysis and manipulation library for Python, allows developers to easily manipulate various data formats. Combining the two, you can efficiently manage and analyze data stored in the cloud. In this article, we will guide you through four examples of how to read CSV and Excel files from AWS S3 using Pandas.

Prerequisites

  • Python installed on your machine.
  • Pandas and Boto3 libraries installed. You can install them using pip install pandas boto3.
  • An AWS account with access and secret keys.

Example 1: Basic CSV File Reading

First, ensure that your AWS credentials are set up correctly. You can do this by setting the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, or by configuring AWS CLI with aws configure.

import pandas as pd
import boto3
from io import StringIO

# Create an S3 client
s3 = boto3.client('s3', aws_access_key_id='YOUR_ACCESS_KEY', aws_secret_access_key='YOUR_SECRET_KEY')

# Specify bucket and object key
bucket_name = 'your-bucket-name'
file_name = 'your-file-name.csv'

# Get the object
obj = s3.get_object(Bucket=bucket_name, Key=file_name)

# Read the data into a pandas DataFrame
df = pd.read_csv(StringIO(obj['Body'].read().decode('utf-8')))

print(df.head())

Example 2: CSV File with Custom Delimiter

Reading a CSV file with a non-standard delimiter is also straightforward with Pandas and S3. Let’s use a semicolon (;) as an example.

import pandas as pd
import boto3
from io import StringIO

s3 = boto3.client('s3')

bucket_name = 'your-bucket-name-again'
file_name = 'your-semicolon-delimited-file.csv'

obj = s3.get_object(Bucket=bucket_name, Key=file_name)

df = pd.read_csv(StringIO(obj['Body'].read().decode('utf-8')), delimiter=';')

print(df.head())

Example 3: Reading an Excel File

Reading Excel files is as easy as reading CSV files. However, you need to have openpyxl or xlrd installed depending on the Excel file’s version.

import pandas as pd
import boto3
from io import BytesIO

s3 = boto3.client('s3')

bucket_name = 'your-excel-file-bucket'
file_name = 'your-excel-file.xlsx'

obj = s3.get_object(Bucket=bucket_name, Key=file_name)

df = pd.read_excel(BytesIO(obj['Body'].read()), engine='openpyxl')

print(df.head())

Example 4: Reading CSV Files in Chunks

When dealing with large files that might not fit into memory, Pandas allows you to read the file in chunks. Here is how you can do it with a CSV file from S3.

import pandas as pd
import boto3
from io import StringIO

s3 = boto3.client('s3')

bucket_name = 'your-large-file-bucket'
file_name = 'your-large-csv-file.csv'

obj = s3.get_object(Bucket=bucket_name, Key=file_name)
csv_content = obj['Body'].read().decode('utf-8')

chunk_size = 10000 # Number of lines
for chunk in pd.read_csv(StringIO(csv_content), chunksize=chunk_size):
    print(chunk.head())

These examples showcase the basic methods to read data from AWS S3 into Pandas DataFrames, offering a solid foundation for further data analysis and manipulation. Whether handling CSV or Excel files, small or large datasets, the combination of Pandas and AWS S3 provides a robust solution for data scientists and developers.

Next Article: Pandas – Understanding DataFrame.eval() Method (with examples)

Previous Article: Pandas – Using DataFrame.cumsum() method (with examples)

Series: DateFrames in Pandas

Pandas

You May Also Like

  • How to Use Pandas Profiling for Data Analysis (4 examples)
  • How to Handle Large Datasets with Pandas and Dask (4 examples)
  • Pandas – Using DataFrame.pivot() method (3 examples)
  • Pandas: How to ‘FULL JOIN’ 2 DataFrames (3 examples)
  • Pandas: Select columns whose names start/end with a specific string (4 examples)
  • 3 ways to turn off future warnings in Pandas
  • How to Integrate Pandas with Apache Spark
  • How to Use Pandas for Web Scraping and Saving Data (2 examples)
  • How to Clean and Preprocess Text Data with Pandas (3 examples)
  • Pandas – Using Series.replace() method (3 examples)
  • Pandas json_normalize() function: Explained with examples
  • Using pandas.Series.rank() method (4 examples)
  • Pandas: Dropping columns whose names contain a specific string (4 examples)
  • Pandas: How to print a DataFrame without index (3 ways)
  • Fixing Pandas NameError: name ‘df’ is not defined
  • Pandas – Using DataFrame idxmax() and idxmin() methods (4 examples)
  • Pandas FutureWarning: ‘M’ is deprecated and will be removed in a future version, please use ‘ME’ instead
  • Pandas: Checking equality of 2 DataFrames (element-wise)
  • Understanding pandas.DataFrame.loc[] through 6 examples