Using DataFrame.sample() method in Pandas (5 examples)

Overview
Example 1: Basic Random Sampling of Rows
Example 2: Sampling With Replacement
Example 3: Sampling a Fraction of the DataFrame
Example 4: Random Sampling of Columns
Example 5: Combining Random seed and Weights
Conclusion

Overview

The sample() method in Pandas is a powerful tool for selecting random rows or columns from your DataFrame. This method provides a simple way to perform random sampling, which is vital in data analysis for making inferences or for generating data subsets. This tutorial covers five practical examples, ranging from basic use cases to more advanced applications of the sample() method.

Prerequisites

Before diving into the examples, ensure you have Pandas installed. If not, you can install it using pip:

pip install pandas

Example 1: Basic Random Sampling of Rows

Let’s start with the most straightforward application – sampling a fixed number of rows randomly from a DataFrame. It’s useful when you need a quick subset of your data:

import pandas as pd
import numpy as np

# Creating a sample DataFrame
data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
        'Age': [28, 34, 29, 32],
        'City': ['New York', 'Paris', 'Berlin', 'London']}
df = pd.DataFrame(data)

# Sampling 2 random rows
sampled_df = df.sample(n=2)
print(sampled_df)

The output will consist of 2 randomly selected rows from our DataFrame, each time you run the command.

    Name  Age    City
1   Anna   34   Paris
3  Linda   32  London

    Name  Age    City
3  Linda   32  London
1   Anna   34   Paris

Example 2: Sampling With Replacement

Sometimes, you might want to sample with replacement, meaning the same row could be chosen more than once. This technique is especially useful in bootstrapping methods:

sampled_df = df.sample(n=2, replace=True)
print(sampled_df)

This code will allow rows to be selected more than once, potentially resulting in duplicate rows in the output.

Output (random);

    Name  Age    City
2  Peter   29  Berlin
2  Peter   29  Berlin

Example 3: Sampling a Fraction of the DataFrame

Instead of specifying the exact number of rows, you might want to select a random fraction of your DataFrame. This is particularly handy when dealing with large datasets:

sampled_df = df.sample(frac=0.5)
print(sampled_df)

This command will return approximately 50% of the rows from the DataFrame. Note that due to random selection, the exact number of rows can vary slightly in each run.

Example 4: Random Sampling of Columns

The sample() method is not limited to rows; it can also be applied to columns. This is useful when you want to analyze or visualize a subset of data attributes:

sampled_df = df.sample(n=2, axis=1)
print(sampled_df)

In this case, we specify axis=1 to indicate column-wise sampling, resulting in 2 randomly selected columns.

Example 5: Combining Random seed and Weights

For more control over the sampling process, you can use a random seed to ensure repeatability and assign weights to make certain rows more likely to be chosen:

weights = [0.1, 0.2, 0.3, 0.4]  # Higher weight means higher chance of being selected
sampled_df = df.sample(n=2, weights=weights, random_state=42)
print(sampled_df)

By setting a random seed (random_state=42), the selection becomes predictable and repeatable across runs. Adjusting weights influences each row’s likelihood of being selected, offering a way to prioritize certain data points.

Conclusion

The sample() method in Pandas is a versatile tool for random sampling, enabling a broad array of data analysis tasks. Starting with basic random row sampling and progressing to more complex scenarios like weighted sampling with a fixed seed, this method significantly enhances the flexibility and power of data manipulation in Pandas. By incorporating these techniques into your workflow, you can more effectively analyze and understand your datasets.

Next Article: Using DataFrame.set_axis() method in Pandas

Previous Article: Pandas – Using DataFrame.reset_index() method

Series: DateFrames in Pandas

Pandas

How to Use Pandas for Geospatial Data Analysis (3 examples)

February 28, 2024