Overview
The sample()
method in Pandas is a powerful tool for selecting random rows or columns from your DataFrame. This method provides a simple way to perform random sampling, which is vital in data analysis for making inferences or for generating data subsets. This tutorial covers five practical examples, ranging from basic use cases to more advanced applications of the sample()
method.
Prerequisites
Before diving into the examples, ensure you have Pandas installed. If not, you can install it using pip:
pip install pandas
Example 1: Basic Random Sampling of Rows
Let’s start with the most straightforward application – sampling a fixed number of rows randomly from a DataFrame. It’s useful when you need a quick subset of your data:
import pandas as pd
import numpy as np
# Creating a sample DataFrame
data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 34, 29, 32],
'City': ['New York', 'Paris', 'Berlin', 'London']}
df = pd.DataFrame(data)
# Sampling 2 random rows
sampled_df = df.sample(n=2)
print(sampled_df)
The output will consist of 2 randomly selected rows from our DataFrame, each time you run the command.
Name Age City
1 Anna 34 Paris
3 Linda 32 London
Name Age City
3 Linda 32 London
1 Anna 34 Paris
Example 2: Sampling With Replacement
Sometimes, you might want to sample with replacement, meaning the same row could be chosen more than once. This technique is especially useful in bootstrapping methods:
sampled_df = df.sample(n=2, replace=True)
print(sampled_df)
This code will allow rows to be selected more than once, potentially resulting in duplicate rows in the output.
Output (random);
Name Age City
2 Peter 29 Berlin
2 Peter 29 Berlin
Example 3: Sampling a Fraction of the DataFrame
Instead of specifying the exact number of rows, you might want to select a random fraction of your DataFrame. This is particularly handy when dealing with large datasets:
sampled_df = df.sample(frac=0.5)
print(sampled_df)
This command will return approximately 50% of the rows from the DataFrame. Note that due to random selection, the exact number of rows can vary slightly in each run.
Example 4: Random Sampling of Columns
The sample()
method is not limited to rows; it can also be applied to columns. This is useful when you want to analyze or visualize a subset of data attributes:
sampled_df = df.sample(n=2, axis=1)
print(sampled_df)
In this case, we specify axis=1
to indicate column-wise sampling, resulting in 2 randomly selected columns.
Example 5: Combining Random seed and Weights
For more control over the sampling process, you can use a random seed to ensure repeatability and assign weights to make certain rows more likely to be chosen:
weights = [0.1, 0.2, 0.3, 0.4] # Higher weight means higher chance of being selected
sampled_df = df.sample(n=2, weights=weights, random_state=42)
print(sampled_df)
By setting a random seed (random_state=42
), the selection becomes predictable and repeatable across runs. Adjusting weights influences each row’s likelihood of being selected, offering a way to prioritize certain data points.
Conclusion
The sample()
method in Pandas is a versatile tool for random sampling, enabling a broad array of data analysis tasks. Starting with basic random row sampling and progressing to more complex scenarios like weighted sampling with a fixed seed, this method significantly enhances the flexibility and power of data manipulation in Pandas. By incorporating these techniques into your workflow, you can more effectively analyze and understand your datasets.