Sling Academy
Home/Pandas/Pandas: How to filter a DataFrame using ‘LIKE’ and ‘NOT LIKE’ like in SQL

Pandas: How to filter a DataFrame using ‘LIKE’ and ‘NOT LIKE’ like in SQL

Last updated: February 20, 2024

Introduction

In data analysis, filtering data is a foundational task that allows us to focus on specific information within a large dataset. As we navigate through data with Pandas, a key library in Python, understanding various filtering techniques becomes crucial. If you’re familiar with SQL, you might have used the ‘LIKE’ and ‘NOT LIKE’ operators for pattern matching. In this tutorial, we’ll explore how to implement similar functionality in Pandas when working with DataFrames.

Getting Started

First, ensure you have Pandas installed. If not, you can install it using pip:

pip install pandas

We’ll also use the ‘random’ and ‘numpy’ libraries for generating sample data, so make sure these are installed as well:

pip install numpy

Let’s begin by creating a sample DataFrame to work with:

import pandas as pd
import numpy as np
import random

# Set a random seed for reproducibility
random.seed(0)

# Sample DataFrame
data = {'Name': ['John Doe', 'Jane Doe', 'Alice Brown', 'Bob White', 'Charlie Black', 'Dana Green'],
        'Age': np.random.randint(20, 50, size=6),
        'City': ['New York', 'Los Angeles', 'New York', 'Boston', 'New York', 'Los Angeles']}
df = pd.DataFrame(data)
print(df)

With our DataFrame ready, let’s dive into how to filter it using ‘LIKE’ and ‘NOT LIKE’ functions.

Using str.contains()

The str.contains() method is the way to replicate the ‘LIKE’ behavior in Pandas. It allows for partial string matching and regular expressions. Here’s a simple example where we filter names that contain ‘Doe’:

df_doe = df[df['Name'].str.contains('Doe')]
print(df_doe)

For ‘NOT LIKE’, we simply add the ~ operator before the condition:

df_not_doe = df[~df['Name'].str.contains('Doe')]
print(df_not_doe)

Using Regular Expressions

For more flexibility, str.contains() can also accept regular expressions. This allows us to perform more complex pattern matching. To find names starting with ‘A’:

df_a_names = df[df['Name'].str.contains('^A')]
print(df_a_names)

To exclude those names, invert the condition:

df_not_a_names = df[~df['Name'].str.contains('^A')]
print(df_not_a_names)

Case Sensitivity and Handling Missing Values

By default, str.contains() is case sensitive. To make it case-insensitive, use the case=False parameter. Additionally, you might encounter missing values in your data. To avoid errors, use the na=False parameter to treat NaN values as false in your condition:

df_case_insensitive = df[df['Name'].str.contains('doe', case=False, na=False)]
print(df_case_insensitive)

Complex Filtering Using Logical Operators

To combine multiple conditions, you can use logical operators such as & (AND) and | (OR). Here’s how to filter names that contain ‘Doe’ or ‘Brown’ and are from ‘New York’:

df_complex = df[(df['Name'].str.contains('Doe|Brown')) & (df['City'] == 'New York')]
print(df_complex)

Advanced: Custom Filtering Functions

For highly specific filtering requirements, you can define custom functions and use df.apply() to apply them row-wise. Suppose we want to filter by names that either contain ‘Doe’ or have an age greater than 30:

def custom_filter(row):
    return 'Doe' in row['Name'] or row['Age'] > 30

df_advanced = df[df.apply(custom_filter, axis=1)]
print(df_advanced)

Conclusion

Filtering DataFrame rows in Pandas doesn’t directly employ SQL’s ‘LIKE’ and ‘NOT LIKE’ operators, but using str.contains(), possibly in combination with regular expressions, logical operators, and custom functions, offers a flexible and powerful alternative. This approach not only allows for basic pattern matching but also accommodates complex filtering requirements. Equipping yourself with these techniques ensures you can navigate and refine datasets with precision and ease.

Next Article: Pandas: How to shuffle rows in a DataFrame

Previous Article: Pandas: How to create an empty DataFrame with column names

Series: DateFrames in Pandas

Pandas

You May Also Like

  • How to Use Pandas Profiling for Data Analysis (4 examples)
  • How to Handle Large Datasets with Pandas and Dask (4 examples)
  • Pandas – Using DataFrame.pivot() method (3 examples)
  • Pandas: How to ‘FULL JOIN’ 2 DataFrames (3 examples)
  • Pandas: Select columns whose names start/end with a specific string (4 examples)
  • 3 ways to turn off future warnings in Pandas
  • How to Integrate Pandas with Apache Spark
  • How to Use Pandas for Web Scraping and Saving Data (2 examples)
  • How to Clean and Preprocess Text Data with Pandas (3 examples)
  • Pandas – Using Series.replace() method (3 examples)
  • Pandas json_normalize() function: Explained with examples
  • Pandas: Reading CSV and Excel files from AWS S3 (4 examples)
  • Using pandas.Series.rank() method (4 examples)
  • Pandas: Dropping columns whose names contain a specific string (4 examples)
  • Pandas: How to print a DataFrame without index (3 ways)
  • Fixing Pandas NameError: name ‘df’ is not defined
  • Pandas – Using DataFrame idxmax() and idxmin() methods (4 examples)
  • Pandas FutureWarning: ‘M’ is deprecated and will be removed in a future version, please use ‘ME’ instead
  • Pandas: Checking equality of 2 DataFrames (element-wise)