Pandas: How to combine categorical columns into a single column

Updated: February 21, 2024 By: Guest Contributor Post a comment

Introduction

Handling categorical data efficiently is crucial in data preprocessing, particularly when you’re preparing your dataset for machine learning models or data analysis. Often, you might find yourself with a dataset where combining multiple categorical columns into a single column could be advantageous. This tutorial will guide you through various ways to do this using Pandas, a powerful and popular library in Python for data manipulation and analysis.

Before we dive into the practical examples, ensure you have Pandas installed in your Python environment. You can install Pandas using pip:

pip install pandas

Getting Started with Your Dataset

First, let’s create a simple DataFrame that we will use throughout this tutorial:

import pandas as pd

data = {'Category1': ['A', 'B', 'C', 'A'],
        'Category2': ['X', 'Y', 'Z', 'X'],
        'Values': [10, 20, 30, 40]}
df = pd.DataFrame(data)
print(df)

Output:

  Category1 Category2  Values
0         A        X      10
1         B        Y      20
2         C        Z      30
3         A        X      40

Method 1: Using ‘+’ operator

One of the simplest ways to combine two columns in Pandas is to use the ‘+’ operator. This method is straightforward when dealing with string or categorical columns. Here’s how you can do it:

df['Combined'] = df['Category1'] + '_' + df['Category2']
print(df)

Output:

  Category1 Category2  Values Combined
0         A        X      10     A_X
1         B        Y      20     B_Y
2         C        Z      30     C_Z
3         A        X      40     A_X

This approach is quite effective for quickly merging two columns, but it might not be the most suitable in all situations, especially if you have numerous categories or require more complex logic.

Method 2: Using apply() Function

This method offers more flexibility. You can define a function that specifies how you want to combine the columns, and then use apply() with a lambda function to apply it to each row. Here’s an example:

def combine_categories(cat1, cat2):
    return f'{cat1}_{cat2}'

df['Combined'] = df.apply(lambda row: combine_categories(row['Category1'], row['Category2']), axis=1)
print(df)

Output:

  Category1 Category2  Values Combined
0         A        X      10     A_X
1         B        Y      20     B_Y
2         C        Z      30     C_Z
3         A        X      40     A_X

The apply() method is more versatile, allowing you to incorporate detailed logic within the function that combines the categories.

Method 3: Using cat() Method of Pandas Series

The cat() method combines Pandas Series objects that contain string data. This method provides a more elegant and Pandas-native way to concatenate string columns:

df['Combined'] = df['Category1'].str.cat(df['Category2'], sep='_')
print(df)

Output:

  Category1 Category2  Values Combined
0         A        X      10     A_X
1         B        Y      20     B_Y
2         C        Z      30     C_Z
3         A        X      40     A_X

This method is highly efficient and concise, making it ideal for straightforward concatenation tasks.

Advanced Technique: Using factorize() and join()

For more complex scenarios, where you might want to combine columns and then map each unique combination to a numerical identifier (for instance, when preparing data for machine learning algorithms), you can use a combination of factorize() and join(). Here is a complete example:

import pandas as pd

data = {
    "Category1": ["A", "B", "C", "A"],
    "Category2": ["X", "Y", "Z", "X"],
    "Values": [10, 20, 30, 40],
}
df = pd.DataFrame(data)

df["Combined"] = df["Category1"] + "_" + df["Category2"]
df["ID"] = pd.factorize(df["Combined"])[0]
print(df)

joined_df = df.join(df["Combined"].str.get_dummies("_"))
print(joined_df)

Output:

  Category1 Category2  Values Combined  ID
0         A         X      10      A_X   0
1         B         Y      20      B_Y   1
2         C         Z      30      C_Z   2
3         A         X      40      A_X   0
  Category1 Category2  Values Combined  ID  A  B  C  X  Y  Z
0         A         X      10      A_X   0  1  0  0  1  0  0
1         B         Y      20      B_Y   1  0  1  0  0  1  0
2         C         Z      30      C_Z   2  0  0  1  0  0  1
3         A         X      40      A_X   0  1  0  0  1  0  0

This advanced technique not only combines the categories but also encodes them, which can be particularly useful in certain data preprocessing scenarios.

Conclusion

Throughout this guide, we’ve explored various methods to combine categorical columns into a single column using Pandas. Starting from basic concatenation to more sophisticated techniques involving data encoding, these methods can cater to a wide range of data manipulation needs. Pandas continues to stand out as a versatile tool for data scientists, offering numerous built-in functions to simplify data processing tasks.