Using pandas.Series.str.match() method with regex (5 examples)

Updated: February 19, 2024 By: Guest Contributor Post a comment

Overview

pandas is a highly versatile tool for data manipulation and analysis in Python. One of its powerful features is the str accessor, which provides vectorized string operations for Series and Indexes. This tutorial focuses on the str.match() method, which is used to find strings matching a regular expression (regex). Here, we present five practical examples to demonstrate its utility, ranging from basic applications to more advanced uses.

Prerequisites

Before diving into the examples, ensure you have pandas installed in your Python environment. You can install it via pip:

pip install pandas

Also, familiarity with regular expressions in Python will be beneficial.

Example 1: Basic Matching

This example demonstrates the simplest use of str.match(), checking if elements in a Series match a pattern.

import pandas as pd

# Sample data
s = pd.Series(['apple', 'banana', 'cherry', 'date'])

# Check for match: starts with 'a'
matches = s.str.match('^a')
print(matches)

The output should be:

0     True
1    False
2    False
3    False
dtype: bool

This example checks whether the fruits start with the letter ‘a’. The ^ denotes the start of a string in regex.

Example 2: Case Insensitive Matching

To perform case insensitive searches, you can use the flags parameter with the re.IGNORECASE flag from the re module.

import pandas as pd
import re

s = pd.Series(['Apple', 'banana', 'Cherry', 'Date'])

# Case insensitive match: starts with 'a'
matches = s.str.match('^a', flags=re.IGNORECASE)
print(matches)

The output:

0     True
1    False
2    False
3    False
dtype: bool

This approach broadens our match to include words starting with ‘A’, regardless of case.

Example 3: Extracting Specific Parts

Sometimes we’re more interested in extracting the part of the string that matches. While str.match() does not directly support extraction, you can combine it with str.extract() for this purpose.

import pandas as pd

s = pd.Series(['apple1', 'banana2', 'cherry3', 'date4'])

# Extract digits
extracted = s.str.extract('(\d)')
print(extracted)

The output shows the extracted digits:

   0
0  1
1  2
2  3
3  4

In this example, we extract the numerical part of each string. The regex \d matches any digit.

Example 4: Advanced Patterns

For more complex pattern matching, you might want to capture multiple groups or use more intricate regex patterns. Here’s how to do that:

import pandas as pd

s = pd.Series(['Mr. Brown', 'Ms. Green', 'Dr. Smith', 'Mrs. White'])

# Match titles
matches = s.str.match('^(Mr|Ms|Dr|Mrs)\.')
extracted = s.str.extract('^(Mr|Ms|Dr|Mrs)\.')
print("Matches:\n", matches)
print("Extracted:\n", extracted)

The output will demonstrate the ability to handle multiple options in a regex pattern and how this can be useful in real-world data manipulation tasks:

Matches:
 0     True
1     True
2     True
3     True
dtype: bool
Extracted:
     0
0   Mr
1   Ms
2  Dr
3 Mrs

This example highlights how to match and extract prefixes from names with varying titles.

Example 5: Matching and Filtering

Finally, we use str.match() to not just find matches but to filter data based on those matches. Combining it with boolean indexing lets us streamline the dataframe:

import pandas as pd

# Sample dataframe
df = pd.DataFrame({'Names': ['John Doe', 'Jane Smith', 'J Doe', 'Smith'], 'Ages': [28, 34, 22, 45]})

# Filter rows where 'Names' match 'J'
filtered_df = df[df['Names'].str.match('^J')]
print(filtered_df)

The dataframe is filtered to only include names starting with ‘J’:

    Names  Ages
0  John Doe    28
1 Jane Smith    34
2      J Doe    22

This example showcases the power of str.match() in data preprocessing and filtering tasks.

Conclusion

The pandas.Series.str.match() method, coupled with regex, provides a powerful way to perform string matching in Python. Through these examples, from basic to advanced, we’ve demonstrated its versatility in different scenarios. Mastering this function can significantly enhance your data manipulation capabilities.