Introduction
When analyzing data with Python, Pandas is an indispensable tool. It offers a vast array of operations for manipulating and analyzing data. One common task in data analysis is filtering data based on multiple conditions. This tutorial will guide you through various methods to filter Pandas DataFrames by multiple conditions, complete with code examples ranging from basic to advanced.
Preparing a Sample DataFrame
Before we dive into the examples, we’ll need to first set up a sample DataFrame to work with. We’ll create a sample DataFrame that contains details about various individuals, including their names, ages, and cities they reside in.
import pandas as pd
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 34, 29, 32],
'City': ['New York', 'Paris', 'Berlin', 'London']
}
df = pd.DataFrame(data)
print(df)
This results in:
Name Age City
0 John 28 New York
1 Anna 34 Paris
2 Peter 29 Berlin
3 Linda 32 London
Now, let’s start filtering this DataFrame based on multiple conditions.
Basic Filtering
To filter a DataFrame by multiple conditions, you can use the &
operator for AND conditions and the |
operator for OR conditions. Let’s start with a basic example where we want to find individuals who are over 30 years old and live in Paris or Berlin.
filtered_df = df[
(df["Age"] > 30) & ((df["City"] == "Paris") | (df["City"] == "Berlin"))
]
print(filtered_df)
Output:
Name Age City
1 Anna 34 Paris
Here, we used parenthesis to group conditions properly, ensuring the correct application of logical operations.
Using query()
Method
The query()
method offers a more readable way to filter DataFrames by multiple conditions. Here’s how you can use it for the same conditions as above:
filtered_df = df.query('(Age > 30) & (City == "Paris" | City == "Berlin")')
print(filtered_df)
Output:
Name Age City
1 Anna 34 Paris
This method is especially useful for complex filtering conditions, making your code more readable.
Advanced Filtering with Custom Functions
For more complex scenarios, you can apply custom functions to filter a DataFrame. Suppose we want to find individuals whose name starts with ‘J’ and live in New York or whose age is above 30. We can achieve this by using apply()
with a custom function.
def custom_filter(row):
return (row['Name'].startswith('J') & (row['City'] == 'New York')) | (row['Age'] > 30)
filtered_df = df[df.apply(custom_filter, axis=1)]
print(filtered_df)
Output:
Name Age City
0 John 28 New York
1 Anna 34 Paris
This method allows for greater flexibility but can be less performant on large DataFrames.
Filtering with Regular Expressions
In some cases, you might want to filter data based on patterns in text data. Pandas’ str
methods combined with regular expressions make this possible. For instance, finding individuals whose name starts with a vowel.
import re
filtered_df = df[df['Name'].str.contains('^[AEIOU]', regex=True)]
print(filtered_df)
Output:
Name Age City
1 Anna 34 Paris
Here, we use a regular expression to match Names that start with an uppercase vowel.
Conclusion
Filtering DataFrames by multiple conditions is a common yet powerful operation in Pandas, enabling fine-grained analysis of data. We’ve explored various methods from simple logical operations to using custom functions and regular expressions for advanced filtering. These techniques form a crucial part of any data analyst’s or scientist’s toolkit, aiding in extracting meaningful insights from large datasets.