Introduction
Pandas is a powerful library in Python used for data manipulation and analysis, which provides DataFrame as its primary data structure. In this tutorial, we will delve into how to select rows based on specific criteria from column values in a Pandas DataFrame. This skill is crucial for data analysis as it allows us to filter and analyze subsets of data efficiently. We will start with basic examples and progressively move to more advanced techniques.
Getting Started
Before starting, make sure you have Pandas installed in your environment:
pip install pandas
Import Pandas and create a sample DataFrame to work with:
import pandas as pd
# Sample DataFrame
data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 34, 29, 40],
'Occupation': ['Engineer', 'Doctor', 'Artist', 'Engineer']}
df = pd.DataFrame(data)
print(df)
This will output:
Name Age Occupation
0 John 28 Engineer
1 Anna 34 Doctor
2 Peter 29 Artist
3 Linda 40 Engineer
Basic Selection
One of the most basic forms of selection is using the ==
operator to filter rows based on a single column’s value. For instance, to select all engineers:
engineers = df[df['Occupation'] == 'Engineer']
print(engineers)
This will display:
Name Age Occupation
0 John 28 Engineer
3 Linda 40 Engineer
Using .loc[]
for Conditional Selection
Another way to select rows is using the .loc[]
method, which allows for more complex conditions. For example, selecting all engineers older than 30:
older_engineers = df.loc[(df['Occupation'] == 'Engineer') & (df['Age'] > 30)]
print(older_engineers)
This outputs:
Name Age Occupation
3 Linda 40 Engineer
Multiple Conditions and Columns
For filtering based on multiple conditions across different columns, combine the conditions using &
(AND) or |
(OR). For example, to find Artists or Doctors who are under 35:
young_professionals = df.loc[(df['Occupation'] == 'Artist') | (df['Occupation'] == 'Doctor') & (df['Age'] < 35)]
print(young_professionals)
This results in:
Name Age Occupation
1 Anna 34 Doctor
2 Peter 29 Artist
Query Method
The query()
method offers a more readable syntax for complex selections. Here’s how you can use it:
eligible_for_discount = df.query('Age < 30 and Occupation != "Engineer"')
print(eligible_for_discount)
Which outputs:
Name Age Occupation
2 Peter 29 Artist
Selecting Based on String Patterns
To filter rows based on string patterns, use methods like str.contains()
. For instance, selecting names that start with ‘J’:
j_names = df[df['Name'].str.contains('^J')]
print(j_names)
This would output:
Name Age Occupation
0 John 28 Engineer
Advanced: Using Functions with apply()
For more advanced row selection, you can use the apply()
method to apply a custom function across rows. For example, selecting rows where the sum of the numerical columns is greater than 65:
def filter_row(row):
return row['Age'] + len(row['Name']) > 65
filtered_rows = df.apply(filter_row, axis=1)
print(df[filtered_rows])
This yields:
Name Age Occupation
1 Anna 34 Doctor
3 Linda 40 Engineer
Conclusion
Selecting rows based on column values is a fundamental operation in data analysis with Pandas. Outlined techniques, ranging from basic filtering to advanced custom functions, facilitate complex data manipulations. Mastering these methods provides a strong foundation for performing sophisticated data analysis tasks in Python.