Pandas DataFrame: Split a column into multiple columns (based on a delimiter like comma or hyphen)

Updated: February 21, 2024 By: Guest Contributor Post a comment

Overview

Pandas is an incredibly powerful tool for data analysis and manipulation. It offers a wide array of functionalities right out of the box, including the ability to reshape your data. One common task when dealing with datasets is splitting a single column into multiple columns based on a delimiter, such as a comma or a hyphen. In this tutorial, we will explore how to achieve that using various methods with Python’s Pandas library.

Prerequisites

Before diving into the code examples, ensure you have Pandas installed in your environment:

pip install pandas

For this tutorial, it’s assumed that you have basic knowledge of Python and pandas.

Basic Splitting

Let’s start with a basic scenario. Suppose we have a DataFrame with a column ‘Name’ containing full names in the format ‘First Name,Last Name’. We want to split this into two columns: ‘First Name’ and ‘Last Name’.

import pandas as pd

df = pd.DataFrame({'Name': ['John Doe', 'Jane Doe', 'Alice Smith']})

df['First Name'], df['Last Name'] = zip(*df['Name'].str.split(',').to_list())
print(df)

This piece of code uses str.split(), which returns a list of strings after breaking the given string by the specified delimiter. We then use zip() to unpack the lists into two new columns.

Splitting with Expand

The expand parameter of str.split() can simplify the process by directly returning a DataFrame.

import pandas as pd

df = pd.DataFrame({'Name': ['John Doe-Jane Doe', 'Alice Smith-Bob Brown']})

df[['First Name', 'Last Name']] = df['Name'].str.split('-', expand=True)
print(df)

Using expand=True will immediately split the column into a DataFrame, allowing for cleaner and more readable code.

Handling More Complex Scenarios

In real-world data, delimiters and patterns can be more complex. Let’s consider a column with values that need to be split into multiple new columns of unequal parts.

import pandas as pd

df = pd.DataFrame({'Info': ['Name: John Doe, Age: 30, Occupation: Engineer', 'Name: Jane Doe, Age: 28, Occupation: Designer']})

new_columns = df['Info'].str.split(',', expand=True)
df = pd.concat([df.drop('Info', axis=1), new_columns], axis=1)
print(df)

This technique can handle complex data by first splitting into a temporary DataFrame, then concatenating it back with the original DataFrame.

Working with Regular Expressions

For even more flexibility, you can use regular expressions with the str.split() method to define custom delimiters.

import pandas as pd

df = pd.DataFrame({'Details': ['John+Doe|30|Engineer', 'Jane+Doe|28|Designer']})

df[['Name', 'Age', 'Occupation']] = df['Details'].str.split('\|', expand=True)
print(df)

In this example, the \|is used as the delimiter to split the string. Note that we escape the pipe character | with a backslash because it has a special meaning in regular expressions.

Advanced: Custom Function with Apply

For very complex splitting logic that may not be directly achievable with str.split(), you can define a custom function and then apply it to the DataFrame.

import pandas as pd

def split_custom(column):
    # Custom split logic here
    return pd.Series(['Split part 1', 'Split part 2'])

df = pd.DataFrame({'Complex Column': ['Value 1', 'Value 2']})
df[['Part 1', 'Part 2']] = df['Complex Column'].apply(split_custom)
print(df)

This method allows for maximum flexibility, although it may not be as performance-efficient as the built-in methods.

Conclusion

Splitting a column into multiple columns based on a delimiter is a common data manipulation task that Pandas handles gracefully. This tutorial covered from basic to advanced scenarios to give you a comprehensive understanding of the different ways you can achieve it. With these techniques in your toolkit, you’re well-equipped to handle a variety of data preprocessing scenarios.