Pandas DataFrame Cheat Sheet

Updated: March 1, 2023 By: Goodman Post a comment

This cheat sheet provides a comprehensive reference for working with Pandas DataFrames, including creating, viewing, selecting, manipulating, cleaning, working with dates, plotting, and exporting data.

Creating a DataFrame

import pandas as pd

# From a list of lists
data = [['Alice', 25], ['Bob', 30], ['Charlie', 35]]
df = pd.DataFrame(data, columns=['Name', 'Age'])

# From a dictionary of lists
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)

# From a CSV file
df = pd.read_csv('data.csv')

# From a SQL database
import sqlite3
conn = sqlite3.connect('mydatabase.db')
df = pd.read_sql_query("SELECT * FROM mytable", conn)

Viewing Data

# View the first n rows of the DataFrame
df.head(n)

# View the last n rows of the DataFrame
df.tail(n)

# View basic information about the DataFrame
df.info()

# View basic statistics about the DataFrame
df.describe()

# View the unique values in a column
df['Gender'].unique()

# View the number of occurrences of each value in a column
df['Gender'].value_counts()

# Sort the DataFrame by one or more columns
df.sort_values(['Age', 'Name'], ascending=[False, True])

Selecting Data

# Select a single column of data
df['Name']

# Select multiple columns of data
df[['Name', 'Age']]

# Select rows of data using boolean indexing
df[df['Age'] > 30]

# Select rows of data using loc (label-based) or iloc (integer-based) indexing
df.loc[1:3, ['Name', 'Age']]
df.iloc[1:3, [0, 1]]

# Select rows of data using query expressions
df.query('Age > 30 and Gender == "F"')

# Select the first occurrence of a value in a column
df['Name'].first_valid_index()

# Select the last occurrence of a value in a column
df['Name'].last_valid_index()

Manipulating Data

# Add a new column to the DataFrame
df['Gender'] = ['F', 'M', 'M']

# Rename columns of the DataFrame
df.rename(columns={'Name': 'Full Name', 'Age': 'Years'}, inplace=True)

# Drop columns or rows from the DataFrame
df.drop(columns=['Gender'], inplace=True)
df.drop([0, 1], inplace=True)

# Group data by a column and calculate summary statistics
df.groupby('Gender').agg({'Years': 'mean', 'Income': 'sum'})

# Apply a function to each element in a column of data
df['Age'].apply(lambda x: x + 1)

# Apply a function to each row or column of data
df.apply(lambda row: row['Age'] + row['Income'], axis=1)
df.apply(lambda col: col.mean() if col.dtype == 'float' else col)

# Merge two DataFrames by a common column
df1.merge(df2, on='id')

# Pivot a DataFrame to create a new table based on column values
df.pivot(index='Date', columns='Product', values='Sales')

Data Cleaning

# Rename columns in the DataFrame
df.rename(columns={'old_name': 'new_name'}, inplace=True)

# Drop columns from the DataFrame
df.drop(columns=['col1', 'col2'], inplace=True)

# Drop rows from the DataFrame
df.drop(index=[0, 1], inplace=True)

# Replace missing values in the DataFrame with a specified value
df.fillna(0, inplace=True)

# Drop rows with missing values in the DataFrame
df.dropna(inplace=True)

# Remove duplicates from the DataFrame
df.drop_duplicates(inplace=True)

# Replace values in the DataFrame with a specified value
df.replace(to_replace='old_value', value='new_value', inplace=True)

# Apply a function to a column or columns in the DataFrame
df['col1'] = df['col1'].apply(function)
df[['col1', 'col2']] = df[['col1', 'col2']].applymap(function)

# Convert a column to a different data type
df['col1'] = df['col1'].astype(int)

# Merge two DataFrames on a common column
df_merged = pd.merge(df1, df2, on='col1', how='inner')

Working With Dates

# Convert a column to a datetime format
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')

# Extract the year, month, day, hour, minute, or second from a datetime column
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
df['Hour'] = df['Date'].dt.hour
df['Minute'] = df['Date'].dt.minute
df['Second'] = df['Date'].dt.second

# Set the index of the DataFrame to a datetime column
df.set_index('Date', inplace=True)

# Resample the DataFrame to a different time frequency
df.resample('D').sum()

Plotting Data

# Create a line plot of a column or columns
df.plot(x='Date', y='Sales')
df.plot(x='Date', y=['Sales', 'Expenses'])

# Create a bar plot of a column or columns
df.plot(x='Product', y='Sales', kind='bar')
df.plot(x='Product', y=['Sales', 'Expenses'], kind='bar')

# Create a scatter plot of two columns
df.plot(x='Sales', y='Expenses', kind='scatter')

# Create a box plot of a column or columns
df.plot(y=['Sales', 'Expenses'], kind='box')

Exporting Data

# Export the DataFrame to a CSV file
df.to_csv('data.csv', index=False)

# Export the DataFrame to an Excel file
df.to_excel('data.xlsx', index=False)

# Export the DataFrame to a SQL database table
conn = sqlite3.connect('mydatabase.db')
df.to_sql('mytable', conn, if_exists='replace')

Afterword

If you are a veteran and have a lot of experience with Pandas, this cheat sheet will help you review and quickly look up the most core content. If you are new to Pandas, this cheat sheet will give you an overview of this amazing framework.

If you need more detailed instructions and more specific examples, continue reading the other articles in this series.