Pandas DataFrame: How to select rows based on column values

Updated: February 20, 2024 By: Guest Contributor Post a comment

Introduction

Pandas is a powerful library in Python used for data manipulation and analysis, which provides DataFrame as its primary data structure. In this tutorial, we will delve into how to select rows based on specific criteria from column values in a Pandas DataFrame. This skill is crucial for data analysis as it allows us to filter and analyze subsets of data efficiently. We will start with basic examples and progressively move to more advanced techniques.

Getting Started

Before starting, make sure you have Pandas installed in your environment:

pip install pandas

Import Pandas and create a sample DataFrame to work with:

import pandas as pd

# Sample DataFrame
data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
        'Age': [28, 34, 29, 40],
        'Occupation': ['Engineer', 'Doctor', 'Artist', 'Engineer']}
df = pd.DataFrame(data)
print(df)

This will output:

    Name  Age Occupation
0   John   28   Engineer
1   Anna   34     Doctor
2  Peter   29     Artist
3  Linda   40   Engineer

Basic Selection

One of the most basic forms of selection is using the == operator to filter rows based on a single column’s value. For instance, to select all engineers:

engineers = df[df['Occupation'] == 'Engineer']
print(engineers)

This will display:

    Name  Age Occupation
0   John   28   Engineer
3  Linda   40   Engineer

Using .loc[] for Conditional Selection

Another way to select rows is using the .loc[] method, which allows for more complex conditions. For example, selecting all engineers older than 30:

older_engineers = df.loc[(df['Occupation'] == 'Engineer') & (df['Age'] > 30)]
print(older_engineers)

This outputs:

    Name  Age Occupation
3  Linda   40   Engineer

Multiple Conditions and Columns

For filtering based on multiple conditions across different columns, combine the conditions using & (AND) or | (OR). For example, to find Artists or Doctors who are under 35:

young_professionals = df.loc[(df['Occupation'] == 'Artist') | (df['Occupation'] == 'Doctor') & (df['Age'] < 35)]
print(young_professionals)

This results in:

    Name  Age Occupation
1   Anna   34     Doctor
2  Peter   29     Artist

Query Method

The query() method offers a more readable syntax for complex selections. Here’s how you can use it:

eligible_for_discount = df.query('Age < 30 and Occupation != "Engineer"')
print(eligible_for_discount)

Which outputs:

    Name  Age Occupation
2  Peter   29     Artist

Selecting Based on String Patterns

To filter rows based on string patterns, use methods like str.contains(). For instance, selecting names that start with ‘J’:

j_names = df[df['Name'].str.contains('^J')]
print(j_names)

This would output:

   Name  Age Occupation
0  John   28   Engineer

Advanced: Using Functions with apply()

For more advanced row selection, you can use the apply() method to apply a custom function across rows. For example, selecting rows where the sum of the numerical columns is greater than 65:

def filter_row(row):
    return row['Age'] + len(row['Name']) > 65

filtered_rows = df.apply(filter_row, axis=1)
print(df[filtered_rows])

This yields:

    Name  Age Occupation
1   Anna   34     Doctor
3  Linda   40   Engineer

Conclusion

Selecting rows based on column values is a fundamental operation in data analysis with Pandas. Outlined techniques, ranging from basic filtering to advanced custom functions, facilitate complex data manipulations. Mastering these methods provides a strong foundation for performing sophisticated data analysis tasks in Python.