Introduction
Pandas is an invaluable toolkit for data manipulation and analysis in Python. One of its powerful features, the query()
method, allows for efficient and concise querying of DataFrame objects. This approach not only simplifies the syntax for filtering data but also often results in more readable code.
This tutorial is designed to guide you through the powerful DataFrame.query()
method in Pandas through 5 practical examples. From basic to advanced usage, we’ll enhance your skills in data manipulation and filtering.
Getting Started
Before diving into the examples, make sure you have Pandas installed:
pip install pandas
Also, ensure you import Pandas in your Python script:
import pandas as pd
Example 1: Basic Query
Consider a DataFrame df
containing employee data with columns ‘Name’, ‘Age’, and ‘Department’:
import pandas as pd
df = pd.DataFrame({
'Name': ['John', 'Doe', 'Elizabeth', 'Mary'],
'Age': [28, 34, 24, 29],
'Department': ['HR', 'Finance', 'IT', 'Marketing']
})
To select employees aged over 28, use:
result = df.query('Age > 28')
print(result)
The output shows Doe and Mary, fitting the criteria:
Name Age Department
1 Doe 34 Finance
3 Mary 29 Marketing
Example 2: Combining Conditions
Moving onto combining conditions, let’s filter employees in the ‘IT’ department over the age of 25:
import pandas as pd
df = pd.DataFrame({
'Name': ['John', 'Doe', 'Elizabeth', 'Mary'],
'Age': [28, 34, 24, 29],
'Department': ['HR', 'Finance', 'IT', 'Marketing']
})
result = df.query("Department == 'IT' & Age > 25")
print(result)
Output:
Empty DataFrame
Columns: [Name, Age, Department]
Index: []
The above query returns an empty DataFrame since there are no employees fitting both conditions.
Example 3: Using Variables in Queries
Queries can utilize variables from the surrounding Python scope using ‘@’. For an age threshold variable:
import pandas as pd
df = pd.DataFrame(
{
"Name": ["John", "Doe", "Elizabeth", "Mary"],
"Age": [28, 34, 24, 29],
"Department": ["HR", "Finance", "IT", "Marketing"],
}
)
age_threshold = 30
result = df.query("Age > @age_threshold")
print(result)
Output:
Name Age Department
1 Doe 34 Finance
Using a variable allows for more dynamic queries and simplifies making adjustments.
Example 4: Index-based Querying
The query()
method can also work on the index of a DataFrame. If the dataframe’s index is meaningful, for instance, a timestamp, querying based on it might be required. To set an example index:
df.set_index('Name', inplace=True)
To query employees named ‘John’ or ‘Doe’:
df.query('index == "John" | index == "Doe"')
Full example:
import pandas as pd
df = pd.DataFrame(
{
"Name": ["John", "Doe", "Elizabeth", "Mary"],
"Age": [28, 34, 24, 29],
"Department": ["HR", "Finance", "IT", "Marketing"],
}
)
df.set_index('Name', inplace=True)
result = df.query('index == "John" | index == "Doe"')
print(result)
Output:
Age Department
Name
John 28 HR
Doe 34 Finance
This technique is particularly useful for time series data.
Example 5: Complex Queries Involving External Data
For more complex scenarios, consider the situation where you need to filter your DataFrame based on conditions relating to another DataFrame or series. For example, suppose we have another DataFrame departments
listing departments with a budget below a certain threshold:
departments = pd.DataFrame(
{
"Department": ["HR", "IT"],
"Budget": [50000, 60000]
}
)
Tp find employees from departments with a budget lower than $55,000:
depts_under_budget = departments.loc[departments['Budget'] < budget_threshold, 'Department']
dept_list = depts_under_budget.tolist()
result = df.query("Department in @dept_list")
Full example:
import pandas as pd
# Original DataFrame
df = pd.DataFrame({
"Name": ["John", "Doe", "Elizabeth", "Mary"],
"Age": [28, 34, 24, 29],
"Department": ["HR", "Finance", "IT", "Marketing"],
})
# Departments DataFrame
departments = pd.DataFrame({
"Department": ["HR", "IT"],
"Budget": [50000, 60000]
})
# Budget threshold
budget_threshold = 55000
# Identifying departments under the budget threshold
depts_under_budget = departments.loc[departments['Budget'] < budget_threshold, 'Department']
# Converting the Series to a list to use in the query
dept_list = depts_under_budget.tolist()
# Using query() to filter df where 'Department' is in the list of departments under the budget threshold
result = df.query("Department in @dept_list")
print(result)
Output:
Name Age Department
0 John 28 HR
Code explained:
- Filter the
departments
DataFrame to find departments with budgets below thebudget_threshold
. - Convert the filtered
Department
Series to a list (dept_list
). - Use the
.query()
method ondf
, utilizing the@
symbol to reference thedept_list
variable within the query string.
Conclusion
The query()
method in Pandas is a robust tool for filtering and querying data efficiently. Through the progressive complexity shown in these examples, it’s clear that query()
can handle a range of scenarios from simple to complex. Its integration with outside variables and DataFrames further supplements its versatility, making it an indispensable tool in your data processing toolkit.