Pandas: Filtering a DataFrame by the ‘OR’ Condition

Updated: February 21, 2024 By: Guest Contributor Post a comment

Introduction

Filtering data is a fundamental aspect of working with pandas DataFrames. Specifically, applying ‘OR’ conditions allows for flexibility in selecting rows that meet at least one of the criteria specified. This tutorial delves into various methods to filter pandas DataFrames using ‘OR’ conditions, employing both simple and advanced techniques.

Preparing a Test DataFrame to Work with

A pandas DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). Before diving into filtering techniques, ensure pandas is installed and imported:

import pandas as pd

Create a sample DataFrame to work with:

import pandas as pd

data = {
    "Name": ["John", "Anna", "Peter", "Linda"],
    "Age": [28, 34, 29, 32],
    "City": ["New York", "Paris", "Berlin", "London"],
}
df = pd.DataFrame(data)

Basic Filtering with ‘OR’ Condition

To filter rows under a single ‘OR’ condition, the logical operator | is used. This section starts with the most straightforward case, then gradually shifts to more complex scenarios.

filtered_df = df[(df['Age'] > 30) | (df['City'] == 'New York')]
print(filtered_df)

This will select the rows where the age is above 30 or the city is ‘New York’. The output should display:

    Name  Age      City
0   John   28  New York
1   Anna   34     Paris
3   Linda  32    London

Using numpy for Complex ‘OR’ Conditions

For more complicated ‘OR’ conditions that involve multiple columns and criteria, the numpy library can be utilized. First, import numpy as np.

import numpy as np

Then you can apply a condition like:

complex_filtered_df = df[np.logical_or(df['Age'] > 30, df['City'].isin(['New York', 'London']))]
print(complex_filtered_df)

The result showcases the flexibility of combining multiple OR conditions:

     Name  Age      City
0   John   28  New York
1   Anna   34     Paris
3   Linda  32    London

Applying ‘OR’ Condition Across Different Data Types

Filters can also be applied across columns of different data types by transforming them into a uniform type, if necessary. For instance, to filter rows based on either a numerical condition on the ‘Age’ column or a string condition on the ‘Name’ column:

mixed_type_filter = df[(df['Age'] > 30) | (df['Name'].str.startswith('J'))]
print(mixed_type_filter)

This displays:

    Name  Age      City
0   John   28  New York
1   Anna   34     Paris
3   Linda  32    London

Advanced: Using query() for ‘OR’ Condition

The query() method offers a readable syntax for filtering DataFrames based on a query expression. This method is particularly useful when dealing with complex filters:

query_filtered_df = df.query('(Age > 30) | (City == "New York")')
print(query_filtered_df)

Which results in:

    Name  Age      City
0   John   28  New York
1   Anna   34     Paris
3   Linda  32    London

Conclusion

Filtering by ‘OR’ condition in pandas can range from straightforward applications to more complex scenarios involving multiple columns and data types. By understanding and applying the methods showcased, one can efficiently filter DataFrames to retrieve desired data subsets, enhancing data analysis and manipulation capabilities.