Sling Academy
Home/Pandas/Pandas DataFrame: Split a column into multiple columns (based on a delimiter like comma or hyphen)

Pandas DataFrame: Split a column into multiple columns (based on a delimiter like comma or hyphen)

Last updated: February 21, 2024

Overview

Pandas is an incredibly powerful tool for data analysis and manipulation. It offers a wide array of functionalities right out of the box, including the ability to reshape your data. One common task when dealing with datasets is splitting a single column into multiple columns based on a delimiter, such as a comma or a hyphen. In this tutorial, we will explore how to achieve that using various methods with Python’s Pandas library.

Prerequisites

Before diving into the code examples, ensure you have Pandas installed in your environment:

pip install pandas

For this tutorial, it’s assumed that you have basic knowledge of Python and pandas.

Basic Splitting

Let’s start with a basic scenario. Suppose we have a DataFrame with a column ‘Name’ containing full names in the format ‘First Name,Last Name’. We want to split this into two columns: ‘First Name’ and ‘Last Name’.

import pandas as pd

df = pd.DataFrame({'Name': ['John Doe', 'Jane Doe', 'Alice Smith']})

df['First Name'], df['Last Name'] = zip(*df['Name'].str.split(',').to_list())
print(df)

This piece of code uses str.split(), which returns a list of strings after breaking the given string by the specified delimiter. We then use zip() to unpack the lists into two new columns.

Splitting with Expand

The expand parameter of str.split() can simplify the process by directly returning a DataFrame.

import pandas as pd

df = pd.DataFrame({'Name': ['John Doe-Jane Doe', 'Alice Smith-Bob Brown']})

df[['First Name', 'Last Name']] = df['Name'].str.split('-', expand=True)
print(df)

Using expand=True will immediately split the column into a DataFrame, allowing for cleaner and more readable code.

Handling More Complex Scenarios

In real-world data, delimiters and patterns can be more complex. Let’s consider a column with values that need to be split into multiple new columns of unequal parts.

import pandas as pd

df = pd.DataFrame({'Info': ['Name: John Doe, Age: 30, Occupation: Engineer', 'Name: Jane Doe, Age: 28, Occupation: Designer']})

new_columns = df['Info'].str.split(',', expand=True)
df = pd.concat([df.drop('Info', axis=1), new_columns], axis=1)
print(df)

This technique can handle complex data by first splitting into a temporary DataFrame, then concatenating it back with the original DataFrame.

Working with Regular Expressions

For even more flexibility, you can use regular expressions with the str.split() method to define custom delimiters.

import pandas as pd

df = pd.DataFrame({'Details': ['John+Doe|30|Engineer', 'Jane+Doe|28|Designer']})

df[['Name', 'Age', 'Occupation']] = df['Details'].str.split('\|', expand=True)
print(df)

In this example, the \|is used as the delimiter to split the string. Note that we escape the pipe character | with a backslash because it has a special meaning in regular expressions.

Advanced: Custom Function with Apply

For very complex splitting logic that may not be directly achievable with str.split(), you can define a custom function and then apply it to the DataFrame.

import pandas as pd

def split_custom(column):
    # Custom split logic here
    return pd.Series(['Split part 1', 'Split part 2'])

df = pd.DataFrame({'Complex Column': ['Value 1', 'Value 2']})
df[['Part 1', 'Part 2']] = df['Complex Column'].apply(split_custom)
print(df)

This method allows for maximum flexibility, although it may not be as performance-efficient as the built-in methods.

Conclusion

Splitting a column into multiple columns based on a delimiter is a common data manipulation task that Pandas handles gracefully. This tutorial covered from basic to advanced scenarios to give you a comprehensive understanding of the different ways you can achieve it. With these techniques in your toolkit, you’re well-equipped to handle a variety of data preprocessing scenarios.

Next Article: Pandas: How to swap 2 rows in a DataFrame (3 approaches)

Previous Article: Pandas DataFrame: How to replace negative values with zero (5 examples)

Series: DateFrames in Pandas

Pandas

You May Also Like

  • How to Use Pandas Profiling for Data Analysis (4 examples)
  • How to Handle Large Datasets with Pandas and Dask (4 examples)
  • Pandas – Using DataFrame.pivot() method (3 examples)
  • Pandas: How to ‘FULL JOIN’ 2 DataFrames (3 examples)
  • Pandas: Select columns whose names start/end with a specific string (4 examples)
  • 3 ways to turn off future warnings in Pandas
  • How to Integrate Pandas with Apache Spark
  • How to Use Pandas for Web Scraping and Saving Data (2 examples)
  • How to Clean and Preprocess Text Data with Pandas (3 examples)
  • Pandas – Using Series.replace() method (3 examples)
  • Pandas json_normalize() function: Explained with examples
  • Pandas: Reading CSV and Excel files from AWS S3 (4 examples)
  • Using pandas.Series.rank() method (4 examples)
  • Pandas: Dropping columns whose names contain a specific string (4 examples)
  • Pandas: How to print a DataFrame without index (3 ways)
  • Fixing Pandas NameError: name ‘df’ is not defined
  • Pandas – Using DataFrame idxmax() and idxmin() methods (4 examples)
  • Pandas FutureWarning: ‘M’ is deprecated and will be removed in a future version, please use ‘ME’ instead
  • Pandas: Checking equality of 2 DataFrames (element-wise)