Pandas: How to filter a DataFrame using ‘LIKE’ and ‘NOT LIKE’ like in SQL

Introduction
Getting Started
Using str.contains()
Using Regular Expressions
Case Sensitivity and Handling Missing Values
Complex Filtering Using Logical Operators
Advanced: Custom Filtering Functions
Conclusion

Introduction

In data analysis, filtering data is a foundational task that allows us to focus on specific information within a large dataset. As we navigate through data with Pandas, a key library in Python, understanding various filtering techniques becomes crucial. If you’re familiar with SQL, you might have used the ‘LIKE’ and ‘NOT LIKE’ operators for pattern matching. In this tutorial, we’ll explore how to implement similar functionality in Pandas when working with DataFrames.

Getting Started

First, ensure you have Pandas installed. If not, you can install it using pip:

pip install pandas

We’ll also use the ‘random’ and ‘numpy’ libraries for generating sample data, so make sure these are installed as well:

pip install numpy

Let’s begin by creating a sample DataFrame to work with:

import pandas as pd
import numpy as np
import random

# Set a random seed for reproducibility
random.seed(0)

# Sample DataFrame
data = {'Name': ['John Doe', 'Jane Doe', 'Alice Brown', 'Bob White', 'Charlie Black', 'Dana Green'],
        'Age': np.random.randint(20, 50, size=6),
        'City': ['New York', 'Los Angeles', 'New York', 'Boston', 'New York', 'Los Angeles']}
df = pd.DataFrame(data)
print(df)

With our DataFrame ready, let’s dive into how to filter it using ‘LIKE’ and ‘NOT LIKE’ functions.

Using str.contains()

The str.contains() method is the way to replicate the ‘LIKE’ behavior in Pandas. It allows for partial string matching and regular expressions. Here’s a simple example where we filter names that contain ‘Doe’:

df_doe = df[df['Name'].str.contains('Doe')]
print(df_doe)

For ‘NOT LIKE’, we simply add the ~ operator before the condition:

df_not_doe = df[~df['Name'].str.contains('Doe')]
print(df_not_doe)

Using Regular Expressions

For more flexibility, str.contains() can also accept regular expressions. This allows us to perform more complex pattern matching. To find names starting with ‘A’:

df_a_names = df[df['Name'].str.contains('^A')]
print(df_a_names)

To exclude those names, invert the condition:

df_not_a_names = df[~df['Name'].str.contains('^A')]
print(df_not_a_names)

Case Sensitivity and Handling Missing Values

By default, str.contains() is case sensitive. To make it case-insensitive, use the case=False parameter. Additionally, you might encounter missing values in your data. To avoid errors, use the na=False parameter to treat NaN values as false in your condition:

df_case_insensitive = df[df['Name'].str.contains('doe', case=False, na=False)]
print(df_case_insensitive)

Complex Filtering Using Logical Operators

To combine multiple conditions, you can use logical operators such as & (AND) and | (OR). Here’s how to filter names that contain ‘Doe’ or ‘Brown’ and are from ‘New York’:

df_complex = df[(df['Name'].str.contains('Doe|Brown')) & (df['City'] == 'New York')]
print(df_complex)

Advanced: Custom Filtering Functions

For highly specific filtering requirements, you can define custom functions and use df.apply() to apply them row-wise. Suppose we want to filter by names that either contain ‘Doe’ or have an age greater than 30:

def custom_filter(row):
    return 'Doe' in row['Name'] or row['Age'] > 30

df_advanced = df[df.apply(custom_filter, axis=1)]
print(df_advanced)

Conclusion

Filtering DataFrame rows in Pandas doesn’t directly employ SQL’s ‘LIKE’ and ‘NOT LIKE’ operators, but using str.contains(), possibly in combination with regular expressions, logical operators, and custom functions, offers a flexible and powerful alternative. This approach not only allows for basic pattern matching but also accommodates complex filtering requirements. Equipping yourself with these techniques ensures you can navigate and refine datasets with precision and ease.

Next Article: Pandas: How to shuffle rows in a DataFrame

Previous Article: Pandas: How to create an empty DataFrame with column names

Series: DateFrames in Pandas

Pandas