Pandas: How to filter a DataFrame by multiple conditions

Updated: February 20, 2024 By: Guest Contributor Post a comment

Introduction

When analyzing data with Python, Pandas is an indispensable tool. It offers a vast array of operations for manipulating and analyzing data. One common task in data analysis is filtering data based on multiple conditions. This tutorial will guide you through various methods to filter Pandas DataFrames by multiple conditions, complete with code examples ranging from basic to advanced.

Preparing a Sample DataFrame

Before we dive into the examples, we’ll need to first set up a sample DataFrame to work with. We’ll create a sample DataFrame that contains details about various individuals, including their names, ages, and cities they reside in.

import pandas as pd
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 34, 29, 32],
'City': ['New York', 'Paris', 'Berlin', 'London']
}
df = pd.DataFrame(data)
print(df)

This results in:

    Name  Age      City
0 John 28 New York
1 Anna 34 Paris
2 Peter 29 Berlin
3 Linda 32 London

Now, let’s start filtering this DataFrame based on multiple conditions.

Basic Filtering

To filter a DataFrame by multiple conditions, you can use the & operator for AND conditions and the | operator for OR conditions. Let’s start with a basic example where we want to find individuals who are over 30 years old and live in Paris or Berlin.

filtered_df = df[
    (df["Age"] > 30) & ((df["City"] == "Paris") | (df["City"] == "Berlin"))
]
print(filtered_df)

Output:

   Name  Age   City
1 Anna 34 Paris

Here, we used parenthesis to group conditions properly, ensuring the correct application of logical operations.

Using query() Method

The query() method offers a more readable way to filter DataFrames by multiple conditions. Here’s how you can use it for the same conditions as above:

filtered_df = df.query('(Age > 30) & (City == "Paris" | City == "Berlin")')
print(filtered_df)

Output:

   Name  Age   City
1 Anna 34 Paris

This method is especially useful for complex filtering conditions, making your code more readable.

Advanced Filtering with Custom Functions

For more complex scenarios, you can apply custom functions to filter a DataFrame. Suppose we want to find individuals whose name starts with ‘J’ and live in New York or whose age is above 30. We can achieve this by using apply() with a custom function.

def custom_filter(row):
return (row['Name'].startswith('J') & (row['City'] == 'New York')) | (row['Age'] > 30)

filtered_df = df[df.apply(custom_filter, axis=1)]
print(filtered_df)

Output:

   Name  Age      City
0 John 28 New York
1 Anna 34 Paris

This method allows for greater flexibility but can be less performant on large DataFrames.

Filtering with Regular Expressions

In some cases, you might want to filter data based on patterns in text data. Pandas’ str methods combined with regular expressions make this possible. For instance, finding individuals whose name starts with a vowel.

import re
filtered_df = df[df['Name'].str.contains('^[AEIOU]', regex=True)]
print(filtered_df)

Output:

   Name  Age    City
1 Anna 34 Paris

Here, we use a regular expression to match Names that start with an uppercase vowel.

Conclusion

Filtering DataFrames by multiple conditions is a common yet powerful operation in Pandas, enabling fine-grained analysis of data. We’ve explored various methods from simple logical operations to using custom functions and regular expressions for advanced filtering. These techniques form a crucial part of any data analyst’s or scientist’s toolkit, aiding in extracting meaningful insights from large datasets.