Pandas: Replace each occurrence of regex pattern in Series

Updated: February 19, 2024 By: Guest Contributor Post a comment

Overview

The Python Data Analysis Library, or Pandas, is a powerhouse tool widely used for data manipulation and analysis. One of its core features is the Series object, a one-dimensional labeled array capable of holding any data type. In this tutorial, you’ll learn how to leverage Pandas to replace occurrences of regular expressions (regex) in a Series. We will explore this functionality with comprehensive examples, incrementing in complexity to provide a deep understanding of this powerful capability.

Getting Started

Before diving into our examples, ensure you have Pandas installed in your Python environment. If it’s not installed, you can easily do so via pip:

pip install pandas

Now, import Pandas in your script:

import pandas as pd

Basic Usage

At the most basic level, replacing text via regex in a Pandas Series can be utilized to clean or modify data efficiently. Here’s how you can do it:

data = pd.Series(['New York', 'Paris', 'Berlin', 'London'])
s = data.str.replace('o', '0', regex=True)
print(s)

The code above would output:

0    New Y0rk
1       Paris
2      Berlin
3      L0nd0n
dtype: object

This basic example demonstrates replacing all occurrences of the letter ‘o’ with ‘0’. The ‘regex=True’ parameter tells Pandas that the first argument in the replace function is a regex pattern.

Using Patterns

Let’s take this a step further by using a regex pattern to replace any sequence of numbers with a ‘#’. This is particularly useful in data cleaning processes:

data = pd.Series(['User 1', 'User 22', 'User 333'])
s = data.str.replace(r'\d+', '#', regex=True)
print(s)

The output will be:

0    User #
1    User #
2    User #
dtype: object

Group Replacement

Regular expressions allow for pattern grouping, which can be very powerful when combined with Pandas’ replace function. Imagine wanting to swap the position of first and last names in a series:

data = pd.Series(['John Doe', 'Jane Roe'])
s = data.str.replace(r'(\w+) (\w+)', '\2, \1', regex=True)
print(s)

And the output:

0    Doe, John
1    Roe, Jane
dtype: object

Conditional Replacement

Sometimes, you may want to replace text based on specific conditions. You can combine Python functions with regex for more complex replacements. For example, replacing only those numbers greater than 4 with ‘X’:

import re
def replace_func(match):
    value = int(match.group(0))
    return 'X' if value > 4 else match.group(0)

data = pd.Series(['1 wings', '2 wings', '5 wings', '6 wings'])
s = data.str.replace(r'\d', replace_func, regex=True)
print(s)

Note: The ‘replace_func’ is a Python function that takes a match object and returns a string. It’s used as a replacement function in the above example.

Applying to a DataFrame

Although our focus is on Series, it’s worth mentioning that you can apply the same concepts to DataFrames by selecting the column(s) you want to modify:

df = pd.DataFrame({'names': ['John Doe', 'Jane Roe'], 'city': ['New York', 'Paris']})
df['names'] = df['names'].str.replace(r'(\w+) (\w+)', '\2, \1', regex=True)
print(df)

Conclusion

Mastering the use of regular expressions within Pandas Series objects effectively enables powerful data manipulation and cleaning operations. Through the examples provided, we demonstrated from simple character replacements to conditional and group replacements, showcasing the flexibility and utility of regex within Pandas for data processing tasks. As you become more comfortable with regex and Pandas, you’ll find this skillset invaluable for preparing and analyzing data.