Using Pandas from_dummies() function (4 examples)

Updated: February 21, 2024 By: Guest Contributor Post a comment

Introduction

Pandas is a powerful library for data manipulation and analysis in Python, offering a range of functions to make data processing easier and more intuitive. Among its lesser-known but incredibly handy functions is from_dummies(). This function is a game-changer for dealing with one-hot encoded data, converting it back to its original categorical form. In this tutorial, we will explore the from_dummies() function through four illustrative examples, gradually increasing in complexity. Whether you’re preprocessing data for machine learning or simply managing datasets, understanding how to use from_dummies() will greatly simplify your workflow.

Limited by article length, let’s dive directly into the examples.

Example 1: Basic Use Case

Let’s start with a basic example to understand how the from_dummies() function works. Suppose we have a DataFrame df that resulted from one-hot encoding a categorical column. It looks something like this:

import pandas as pd

# Sample DataFrame from one-hot encoding
data = {'Animal_dog': [1, 0], 'Animal_cat': [0, 1], 'Animal_bird': [0, 0]}
df = pd.DataFrame(data)

# Using from_dummies() to revert one-hot encoding
df_original = pd.from_dummies(df, prefix_sep='_')
print(df_original)

Output:

   Animal
0    dog
1    cat

In this simple scenario, from_dummies() efficiently converts the one-hot encoded data back to its original categorical format, as seen in the output.

Example 2: Handling Missing Categories

Sometimes, the dataset might not represent all possible categories in your one-hot encoded columns. In such cases, it’s essential to handle missing categories correctly. Here’s how to do it:

import pandas as pd

# Including a 'dummy_na' column for handling missing categories
data = {'Animal_dog': [1, 0, 0], 'Animal_cat': [0, 1, 0], 'Animal_bird': [0, 0, 1], 'dummy_na': [0, 0, 0]}
df = pd.DataFrame(data)

# Reverting one-hot encoding with missing categories
# The dummy_na=True option is used to indicate missing values.
df_original = pd.from_dummies(df, prefix_sep='_', dummy_na=True)
print(df_original)

Output:

   Animal
0    dog
1    cat
2   NaN

This example demonstrates the ability of from_dummies() to restore the original dataset even when some categories are missing, marked by NaN (Not a Number), effectively handling incomplete datasets.

Example 3: Advanced Data Transformation

Moving to a more complex example, let’s see how from_dummies() can be leveraged for transforming a DataFrame with multiple one-hot encoded columns. Consider a dataset where we not only have animals but also colors as categories. The task is to merge these back into their original form:

import pandas as pd

# Advanced data transformation
# Columns for two different categorical sets: animals and colors
data = {
    'Animal_dog': [1, 0, 0],
    'Animal_cat': [0, 1, 0],
    'Color_red': [0, 0, 1],
    'Color_green': [1, 0, 0],
    'Color_blue': [0, 1, 0]
}
df = pd.DataFrame(data)

df_original = pd.from_dummies(df, prefix_sep='_', columns=['Animal_dog', 'Animal_cat', 'Color_red', 'Color_green', 'Color_blue'])
print(df_original)

Output:

  Animal   Color
0    dog  green
1    cat   blue
2    NaN    red

In this example, by specifying columns that belong to different categorical sets, the from_dummies() function is able to construct a more complex DataFrame retrieving the original categorical data accurately.

Example 4: Custom Prefix Separator

Last but not least, it’s important to understand how the from_dummies() function deals with different prefix separators. If your data uses a custom prefix separator, here’s how you adapt:

import pandas as pd

# Sample data with a custom prefix separator
data = {'animal#dog': [1, 0, 0], 'animal#cat': [0, 1, 0], 'color#red': [0, 0, 1], 'color#green': [1, 0, 0], 'color#blue': [0, 1, 0]}
df = pd.DataFrame(data)

# Custom prefix separator
pd.from_dummies(df, prefix_sep='#')
print(df_original)

Output:

  animal  color
0    dog  green
1    cat   blue
2    NaN    red

This showcases the flexibility of from_dummies() in accommodating different prefix separations, allowing for straightforward reversion to the original format regardless of the separator used in the dataset.

Conclusion

The from_dummies() function in Pandas provides a robust and efficient way to revert one-hot encoded data back to its original categorical state. Through the examples provided, from simple to more complex cases, it’s clear that this function enhances data preprocessing tasks, enabling more fluid data analysis processes. Understanding and utilizing from_dummies() effectively can significantly streamline your data manipulation workflows.