Understanding Pandas get_dummies() function (5 examples)

Updated: February 21, 2024 By: Guest Contributor Post a comment

Introduction

The pandas.get_dummies() function is an essential tool in the data scientist’s toolkit, especially when dealing with categorical data. It allows the conversion of categorical variable(s) into dummy/indicator variables, which is a critical step in preparing data for machine learning models. This tutorial will walk you through understanding and utilizing this function with five practical examples, gradually increasing in complexity.

Basic Usage of get_dummies()

Let’s start with the basics. The simplest form of pandas.get_dummies() involves converting a single categorical column into dummy variables. Consider a dataset df with a column 'Color' having three categories: 'Red', 'Green', and 'Blue'.

import pandas as pd

# Sample dataset
df = pd.DataFrame({
    'Color': ['Red', 'Green', 'Blue', 'Green', 'Red']
})

# Applying get_dummies()
dummy_df = pd.get_dummies(df['Color'])

print(dummy_df)

The output will look something like this:

   Blue  Green  Red
0     0      0    1
1     0      1    0
2     1      0    0
3     0      1    0
4     0      0    1

This table shows the binary representation of different colors, turning the categorical variable into a format that can easily be processed by algorithms.

Multi-column Conversion and Prefixing

In this example, we extend the application of get_dummies() to multiple columns and add prefixes to clarify the origin of the dummy variables. Assume df now includes an additional categorical variable, 'Size'.

df['Size'] = ['S', 'M', 'L', 'M', 'S']

# Applying get_dummies() to multiple columns and adding prefixes
dummies = pd.get_dummies(df, columns=['Color', 'Size'], prefix=['color', 'size'])

print(dummies)

The resulting DataFrame includes separate dummy variables for both ‘Color’ and ‘Size’, with added prefixes for better understanding:

   color_Blue  color_Green  color_Red  size_L  size_M  size_S
0           0            0          1       0       0       1
1           0            1          0       0       1       0
2           1            0          0       1       0       0
3           0            1          0       0       1       0
4           0            0          1       0       0       1

Handling Missing Values

Handling missing values is an essential part of data preprocessing. When using get_dummies(), pandas automatically ignores missing values. However, sometimes it’s useful to keep a column for missing values as well. This can be achieved using the dummy_na parameter.

df['Size'] = ['S', 'M', 'L', null, 'S']  # Assume null represents a missing value

# Handling missing values
pd.get_dummies(df, columns=['Color', 'Size'], prefix=['color', 'size'], dummy_na=True)

This command adds an additional dummy variable for missing values in the ‘Size’ column, as evident in the following output:

   color_Blue  color_Green  color_Red  size_L  size_M  size_S  size_nan
0           0            0          1       0       0       1         0
1           0            1          0       0       1       0         0
2           1            0          0       1       0       0         0
3           0            1          0       0       1       0         1
4           0            0          1       0       0       1         0

Custom Separator for Prefix

By default, get_dummies() uses an underscore (_) as a separator between the prefix and the category name. However, you can customize this by using the prefix_sep parameter. This can be especially useful for readability or following specific naming conventions.

# Custom separator example
pd.get_dummies(df, columns=['Color', 'Size'], prefix=['color', 'size'], prefix_sep=':')

Here, the output would use colons instead of underscores, producing a more readable format for some applications:

   color:Blue  color:Green  color:Red  size:L  size:M  size:S
0           0            0          1       0       0       1
1           0            1          0       0       1       0
2           1            0          0       1       0       0
3           0            1          0       0       1       0
4           0            0          1       0       0       1

Integrating with Machine Learning Models

Finally, it’s time to discuss how get_dummies() integrates with machine learning models. Converting categorical variables into dummy variables is often a necessary preprocessing step. Here’s a basic example of how to use this with a linear regression model in scikit-learn:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Assume df has a target variable 'Price'
X = pd.get_dummies(df.drop('Price', axis=1))
Y = df['Price']

# Splitting dataset
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Training the model
model = LinearRegression().fit(X_train, Y_train)

# Predicting
predictions = model.predict(X_test)

The dummy variables created by get_dummies() allow the linear regression model to interpret and include categorical data in the prediction process effectively.

Conclusion

Understanding and effectively utilizing the pandas.get_dummies() function is crucial for preprocessing data for machine learning. This tutorial provided a foundational understanding along with practical examples that illustrate the flexibility and utility of the function. Manipulating and preparing data correctly are key steps towards constructing reliable and accurate models.