Pandas: Select columns whose names start/end with a specific string (4 examples)

Updated: March 1, 2024 By: Guest Contributor Post a comment

Overview

Pandas, the go-to Python library for data manipulation and analysis, offers multiple ways to select and manipulate data, making it a valuable tool for data scientists and analysts. Selecting columns based on their names is a common operation, especially when dealing with large datasets with numerous columns. Let’s explore how to select columns whose names start or end with a specific string using four practical examples.

Example 1: Basic Selection using str.startswith() and str.endswith()

For our examples, we will work with the following DataFrame:

import pandas as pd
data = {'sales_Jan': [150, 200, 250], 'sales_Feb': [180, 230, 270], 'profit_Jan': [50, 60, 70], 'profit_Feb': [55, 65, 75]}
df = pd.DataFrame(data)

To select columns whose names start with sales, we use the str.startswith() method:

sales_columns = df.loc[:, df.columns.str.startswith('sales')]
print(sales_columns)

This will output:

   sales_Jan  sales_Feb
0        150        180
1        200        230
2        250        270

To select columns ending with Feb, use str.endswith():

feb_columns = df.loc[:, df.columns.str.endswith('Feb')]
print(feb_columns)

Output will be:

   sales_Feb  profit_Feb
0        180        55
1        230        65
2        270        75

Example 2: Using filter() with regex for Pattern Matching

The filter() function offers a more flexible way to filter columns based on their names using regular expressions. To select columns starting with sales or profit.

import pandas as pd
data = {'sales_Jan': [150, 200, 250], 'sales_Feb': [
    180, 230, 270], 'profit_Jan': [50, 60, 70], 'profit_Feb': [55, 65, 75]}
df = pd.DataFrame(data)

matching_columns = df.filter(regex='^(sales|profit)')
print(matching_columns)

Output:

   sales_Jan  sales_Feb  profit_Jan  profit_Feb
0        150        180          50         55
1        200        230          60         65
2        250        270          70         75

This example shows the power of regex for pattern matching when you want more control over the selection criteria.

Example 3: Combining Conditions with Logical Operators

What if you want to select columns that start with sales and end with Feb? This requires combining conditions. Pandas does not support multiple string methods directly chained. However, you can achieve it by using the logical operator & (bitwise AND) with Boolean indexing:

import pandas as pd
data = {'sales_Jan': [150, 200, 250], 'sales_Feb': [
    180, 230, 270], 'profit_Jan': [50, 60, 70], 'profit_Feb': [55, 65, 75]}
df = pd.DataFrame(data)

combo_columns = df.loc[:, df.columns.str.startswith(
    'sales') & df.columns.str.endswith('Feb')]
print(combo_columns)

This will output:

   sales_Feb
0        180
1        230
2        270

Example 4: Advanced Selection using Query with eval()


When dealing with complex selections in Pandas, especially when aiming for readability and maintainability, the query() and eval() methods offer more elegant solutions than traditional methods. However, these methods have limitations, such as not directly supporting column selection based on their names starting or ending with specific strings. To work around this, you can creatively preprocess column names or use conditions outside the query() method. Here’s an example illustrating how to handle such a scenario:

Step 1: Preprocessing Column Names

Let’s say you have a DataFrame and want to select columns whose names start with “sales” and then perform operations on those columns.

import pandas as pd
import numpy as np

# Sample DataFrame
data = {
    'sales_Jan': [100, 200, 150],
    'sales_Feb': [120, 220, 170],
    'profit_Jan': [30, 50, 45],
    'profit_Feb': [35, 55, 48]
}
df = pd.DataFrame(data)

# Preprocess by adding a prefix to column names that meet your criteria
prefixed_df = df.rename(columns=lambda x: 'selected_' + x if x.startswith('sales') else x)

# Now, you can use query or eval with the new column names

Step 2: Using query() with Condition Outside

Although direct selection isn’t supported in query(), you can filter rows based on conditions applied to the dynamically selected columns:

# For illustration, let's say we want to filter rows where January sales are more than 100
# First, identify the columns of interest
columns_of_interest = [col for col in df.columns if col.startswith('sales')]

# Use boolean indexing instead of query() for this part
filtered_df = df[df[columns_of_interest[0]] > 100]

print(filtered_df)

Step 3: Combining with query() for Other Operations

After preprocessing or external conditions, you can use query() for other row-level selections that are more straightforward:

# Assuming you have a 'region' column and want to filter further
df['region'] = ['East', 'West', 'East']  # Adding for the sake of example
filtered_df = filtered_df.query('region == "East"')

print(filtered_df)

Conclusion

Pandas offers versatile and powerful tools for selecting columns based on their name patterns. Understanding how to leverage these tools can save time and simplify your data analysis workflow. From basic string matching to advanced regex filtering, these techniques are indispensable for efficiently managing large datasets.