Pandas DataFrame: How to filter rows using regex/string pattern (5 examples)

Introduction
Getting Started
Example 1: Filtering with Simple String Contains
Example 2: Using Regex for More Flexible Filtering
Example 3: Case-Insensitive Searching
Example 4: Advanced Regex Patterns
Example 5: Using Regex to Match Entire Fields
Conclusion

Introduction

Working with data in Python often involves the use of Pandas DataFrames, a powerful and flexible data structure that allows for efficient data manipulation and analysis. One common task is filtering rows based on certain criteria. This tutorial delves into using regular expressions (regex) and string patterns to filter rows in a Pandas DataFrame. We’ll explore five progressively more complex examples, demonstrating the versatility of regex for data munging tasks.

Prerequisites

Basic understanding of Python
Installation of Pandas library: pip install pandas

Getting Started

First, let’s create a simple DataFrame to work with:

import pandas as pd

data = {'Name': ['John Doe', 'Jane Smith', 'Alice Johnson', 'Mike Brown'],
        'Email': ['[email protected]', '[email protected]', '[email protected]', '[email protected]'],
        'Age': [29, 22, 34, 45]}
df = pd.DataFrame(data)
print(df)

The DataFrame, df, looks like this:

Name            Email                   Age
John Doe        [email protected]       29
Jane Smith      [email protected]    22
Alice Johnson   [email protected]        34
Mike Brown      [email protected]        45

Example 1: Filtering with Simple String Contains

Our first example starts with the most basic form of filtering – using df["Name"].str.contains('pattern') to find rows where the ‘Name’ column contains a certain string pattern.

df[df['Name'].str.contains('Doe')]

Output:

Name            Email               Age
John Doe        [email protected]   29

This method is straightforward but is limited to case sensitivity and exact matches of substring occurrences within the target field.

Example 2: Using Regex for More Flexible Filtering

Let’s move to using regex for a more flexible approach. Regex patterns allow for the matching of specific string sequences and can accommodate a wide range of search criteria. The df["Name"].str.contains(r'regex_pattern', regex=True) method enables this.

df[df['Name'].str.contains(r'D[oaeiou]e', regex=True)]

The output showcases the adaptability of regex:

Name            Email               Age
John Doe        [email protected]   29

This example demonstrated how regex can capture variations in spelling and be more inclusive in the search criteria.

Example 3: Case-Insensitive Searching

Our third example involves making our filtering case-insensitive, which expands our search possibilities even more. This is achieved by adding the case=False parameter to the str.contains method.

df[df['Name'].str.contains('doe', case=False)]

Output:

Name            Email               Age
John Doe        [email protected]   29

This simple addition allows us to capture data regardless of how it’s written, significantly increasing our filtering capabilities.

Example 4: Advanced Regex Patterns

Moving on to more complex regex patterns, let’s explore how to use them to filter for rows based on email domain. In this example, we’ll match any email address that ends with 'email.com'.

df[df['Email'].str.contains(r'[A-Za-z0-9.-][email protected]', regex=True)]

Output:

Name             Email                    Age
John Doe         [email protected]        29
Jane Smith       [email protected]     22
Alice Johnson    [email protected]         34
Mike Brown       [email protected]         45

This example illustrates the power of advanced regex patterns in accurately identifying relevant rows for filtering based on specific criteria.

Example 5: Using Regex to Match Entire Fields

In our final example, we’ll look at how to use regex to match entire fields, a more precise and restricted form of filtering. The syntax df["Email"].str.fullmatch(r'pattern') is used for this purpose.

df[df['Email'].str.fullmatch(r'[email protected]')]

Output:

Name            Email               Age
John Doe        [email protected]   29

This technique is valuable when you want to filter rows based on an exact match of the entire content of a field, providing a high degree of specificity in the results.

Conclusion

Through these five examples, we’ve showcased the flexibility and power of using regex and string patterns for filtering rows in Pandas DataFrames. Each method offers different advantages and can be tailored to meet specific data analysis needs. As you become more familiar with these techniques, you’ll find them indispensable tools for data wrangling and preparation, enabling refined and precise data selection for further analysis.

Next Article: Pandas: How to drop all columns that contain non-numerical values

Previous Article: Pandas DataFrame: Mapping True/False to 1/0

Series: DateFrames in Pandas

Pandas