Overview
Data science and machine learning often require shuffling the datasets to ensure models are not biased toward the order in which data is presented. In Python, Pandas is a powerful tool for data manipulation, and shuffling rows in a DataFrame is a common operation. This tutorial will guide you through multiple methods to shuffle rows in a Pandas DataFrame, from basic to more advanced techniques.
A Quick Note about Pandas DataFrame
Before delving into shuffling rows, let’s briefly understand what a Pandas DataFrame is. A DataFrame is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). It’s akin to a spreadsheet or SQL table.
Basic Shuffling with sample
The most straightforward way to shuffle the rows of a DataFrame is by using the sample
method. This method randomly samples rows from the DataFrame. By setting the frac
parameter to 1, we can shuffle all rows.
import pandas as pd
import numpy as np
# Creating a sample DataFrame
df = pd.DataFrame({
'A': np.arange(10),
'B': np.random.rand(10)
})
# Shuffling the rows
df_shuffled = df.sample(frac=1).reset_index(drop=True)
# Display the shuffled DataFrame
print(df_shuffled)
Shuffling with a Random Seed
To ensure reproducibility, you can shuffle rows with a specific random seed. This is particularly useful in experiments where the same random order of rows is required across different runs.
import pandas as pd
# Sample DataFrame
df = pd.DataFrame({
'A': np.arange(10),
'B': np.random.rand(10)
})
# Shuffling rows with a specific seed
df_shuffled = df.sample(frac=1, random_state=42).reset_index(drop=True)
# Display the shuffled DataFrame
print(df_shuffled)
Using NumPy to Shuffle
Another way to shuffle the DataFrame’s rows is by utilizing NumPy’s random.shuffle
method. Note that this method operates in-place and shuffles the DataFrame’s index. After shuffling the index, we then reindex our DataFrame with the shuffled index.
import pandas as pd
import numpy as np
# Sample DataFrame
df = pd.DataFrame({
'A': np.arange(10),
'B': np.random.rand(10)
})
# Shuffling the index using NumPy
shuffled_index = np.arange(df.shape[0])
np.random.shuffle(shuffled_index)
# Reindexing the DataFrame with the shuffled index
df = df.set_index([shuffled_index]).sort_index()
# Displaying the shuffled DataFrame
print(df)
Shuffling Rows with sklearn
For those working alongside scikit-learn in their projects, there’s an alternative method to shuffle DataFrame rows using the shuffle
function from sklearn.utils.
from sklearn.utils import shuffle
import pandas as pd
# Creating a sample DataFrame
df = pd.DataFrame({
'A': np.arange(10),
'B': np.random.rand(10)
})
# Shuffling the DataFrame rows using sklearn
shuffled_df = shuffle(df, random_state=42)
# Display the shuffled DataFrame
print(shuffled_df)
Advanced: Stratified Shuffling
In cases where maintaining the proportion of a categorical variable across train-test splits is crucial, stratified shuffling is useful. Pandas does not directly support this, but we can use StratifiedShuffleSplit
from sklearn.
from sklearn.model_selection import StratifiedShuffleSplit
import pandas as pd
# Sample DataFrame with a categorical variable
df = pd.DataFrame({
'A': np.arange(10),
'B': ['cat', 'dog', 'cat', 'dog', 'cat', 'dog', 'cat', 'dog', 'cat', 'dog']
})
# Setting up StratifiedShuffleSplit
df['split'] = 0
splitter = StratifiedShuffleSplit(n_splits=1, test_size=0.5, random_state=42)
for train_index, test_index in splitter.split(df, df['B']):
df.loc[test_index, 'split'] = 1
# Separating the DataFrame into train and test sets based on the split marker
train_df = df[df['split'] == 0].drop('split', axis=1)
test_df = df[df['split'] == 1].drop('split', axis=1)
# Display the train and test sets
print("Train Set:")
print(train_df)
print("\nTest Set:")
print(test_df)
Conclusion
Shuffling rows in a Pandas DataFrame can be achieved through various methods, each suitable for different scenarios. Whether you need a simple row shuffle or a more controlled, stratified shuffling for machine learning model validation, Pandas in conjunction with libraries like NumPy and sklearn offers flexible solutions. Understanding and applying these techniques will enhance your data preprocessing skills in Python.