Pandas: Working with the DataFrame.query() method (5 examples)

Updated: February 19, 2024 By: Guest Contributor Post a comment

Introduction

Pandas is an invaluable toolkit for data manipulation and analysis in Python. One of its powerful features, the query() method, allows for efficient and concise querying of DataFrame objects. This approach not only simplifies the syntax for filtering data but also often results in more readable code.

This tutorial is designed to guide you through the powerful DataFrame.query() method in Pandas through 5 practical examples. From basic to advanced usage, we’ll enhance your skills in data manipulation and filtering.

Getting Started

Before diving into the examples, make sure you have Pandas installed:

pip install pandas

Also, ensure you import Pandas in your Python script:

import pandas as pd

Example 1: Basic Query

Consider a DataFrame df containing employee data with columns ‘Name’, ‘Age’, and ‘Department’:

import pandas as pd

df = pd.DataFrame({
    'Name': ['John', 'Doe', 'Elizabeth', 'Mary'],
    'Age': [28, 34, 24, 29],
    'Department': ['HR', 'Finance', 'IT', 'Marketing']
})

To select employees aged over 28, use:

result = df.query('Age > 28')
print(result)

The output shows Doe and Mary, fitting the criteria:

   Name  Age Department
1   Doe   34    Finance
3  Mary   29  Marketing

Example 2: Combining Conditions

Moving onto combining conditions, let’s filter employees in the ‘IT’ department over the age of 25:

import pandas as pd

df = pd.DataFrame({
    'Name': ['John', 'Doe', 'Elizabeth', 'Mary'],
    'Age': [28, 34, 24, 29],
    'Department': ['HR', 'Finance', 'IT', 'Marketing']
})

result = df.query("Department == 'IT' & Age > 25")
print(result)

Output:

Empty DataFrame
Columns: [Name, Age, Department]
Index: []

The above query returns an empty DataFrame since there are no employees fitting both conditions.

Example 3: Using Variables in Queries

Queries can utilize variables from the surrounding Python scope using ‘@’. For an age threshold variable:

import pandas as pd

df = pd.DataFrame(
    {
        "Name": ["John", "Doe", "Elizabeth", "Mary"],
        "Age": [28, 34, 24, 29],
        "Department": ["HR", "Finance", "IT", "Marketing"],
    }
)

age_threshold = 30
result = df.query("Age > @age_threshold")
print(result)

Output:

  Name  Age Department
1  Doe   34    Finance

Using a variable allows for more dynamic queries and simplifies making adjustments.

Example 4: Index-based Querying

The query() method can also work on the index of a DataFrame. If the dataframe’s index is meaningful, for instance, a timestamp, querying based on it might be required. To set an example index:

df.set_index('Name', inplace=True) 

To query employees named ‘John’ or ‘Doe’:

df.query('index == "John" | index == "Doe"')

Full example:

import pandas as pd

df = pd.DataFrame(
    {
        "Name": ["John", "Doe", "Elizabeth", "Mary"],
        "Age": [28, 34, 24, 29],
        "Department": ["HR", "Finance", "IT", "Marketing"],
    }
)

df.set_index('Name', inplace=True) 
result = df.query('index == "John" | index == "Doe"')
print(result)

Output:

      Age Department
Name                
John   28         HR
Doe    34    Finance

This technique is particularly useful for time series data.

Example 5: Complex Queries Involving External Data

For more complex scenarios, consider the situation where you need to filter your DataFrame based on conditions relating to another DataFrame or series. For example, suppose we have another DataFrame departments listing departments with a budget below a certain threshold:

departments = pd.DataFrame(
    {
        "Department": ["HR", "IT"], 
        "Budget": [50000, 60000]
     }
)

Tp find employees from departments with a budget lower than $55,000:

depts_under_budget = departments.loc[departments['Budget'] < budget_threshold, 'Department']

dept_list = depts_under_budget.tolist()

result = df.query("Department in @dept_list")

Full example:

import pandas as pd

# Original DataFrame
df = pd.DataFrame({
    "Name": ["John", "Doe", "Elizabeth", "Mary"],
    "Age": [28, 34, 24, 29],
    "Department": ["HR", "Finance", "IT", "Marketing"],
})

# Departments DataFrame
departments = pd.DataFrame({
    "Department": ["HR", "IT"], 
    "Budget": [50000, 60000]
})

# Budget threshold
budget_threshold = 55000

# Identifying departments under the budget threshold
depts_under_budget = departments.loc[departments['Budget'] < budget_threshold, 'Department']

# Converting the Series to a list to use in the query
dept_list = depts_under_budget.tolist()

# Using query() to filter df where 'Department' is in the list of departments under the budget threshold
result = df.query("Department in @dept_list")

print(result)

Output:

   Name  Age Department
0  John   28         HR

Code explained:

  1. Filter the departments DataFrame to find departments with budgets below the budget_threshold.
  2. Convert the filtered Department Series to a list (dept_list).
  3. Use the .query() method on df, utilizing the @ symbol to reference the dept_list variable within the query string.

Conclusion

The query() method in Pandas is a robust tool for filtering and querying data efficiently. Through the progressive complexity shown in these examples, it’s clear that query() can handle a range of scenarios from simple to complex. Its integration with outside variables and DataFrames further supplements its versatility, making it an indispensable tool in your data processing toolkit.