Pandas: Select Columns by Data Type in a DataFrame

Updated: March 10, 2023 By: Wolf Post a comment

This concise and straightforward article shows you how to select columns by data type in a DataFrame in Pandas.

What is the point?

In Pandas, we can select columns based on their data types using the select_dtypes() method. This method returns a new DataFrame containing only columns of the specified data type. We can specify the data type we want to select using the include or exclude parameters.

If you want to select only columns with numerical data types, you can use:

df.select_dtypes(include=['number'])

In case you want to select only columns with string data types, you can use:

df.select_dtypes(include=['object'])

You can also combine multiple data types using a list or a tuple. For example, if you want to retrieve columns with integer or boolean data types, use this:

df.select_dtypes(include=['int', 'bool'])

The exclude parameter is helpful when you need to specify which data types to exclude from the selection. For example, if you want to select all columns except those with datetime data types, you can use:

df.select_dtypes(exclude=['datetime'])

For a better and deeper understanding, see the real-world examples below.

Examples

Example 1

Suppose we have a DataFrame containing columns of different data types:

import pandas as pd

data = {'name': ['John', 'Jane', 'Bob'],
        'age': [25, 30, 35],
        'net_worth': [1000, 20000, 300000],
        'is_investor': [False, False, True]}

df = pd.DataFrame(data)

To select only the columns with numeric data types, we call the select_dtypes() method with the include parameter set to number like so:

numeric_cols = df.select_dtypes(include='number')
print(numeric_cols)

Output:

   age  net_worth
0   25       1000
1   30      20000
2   35     300000

To select only the non-numeric columns (object, bool, category), we can use the select_dtypes() method with the exclude parameter set to number:

non_numeric_cols = df.select_dtypes(exclude='number')
print(non_numeric_cols)

Output:

   name  is_investor
0  John        False
1  Jane        False
2   Bob         True

Example 2

Let’s say we have a dataset of customer transactions containing columns of different data types:

import pandas as pd

data = {
    'transaction_id': [101, 102, 103, 104, 105],
    'date': ['2023-01-01', '2024-01-02', '2024-01-03', '2023-01-04', '2023-01-05'],
    'customer_id': ['C001', 'C002', 'C003', 'C004', 'C005'],
    'product_name': ['A', 'B', 'C', 'A', 'B'],
    'price': [10.5, 20.0, 5.5, 7.0, 18.5],
    'quantity': [2, 1, 3, 4, 2],
    'is_returned': [False, False, True, False, True]
}

df = pd.DataFrame(data)

# Convert the date column to datetime type
df['date'] = pd.to_datetime(df['date'])

Suppose we are only interested in the dates of transactions and whether those transactions are from old customers or not, we will do the following:

numeric_cols = df.select_dtypes(include=['datetime64', 'bool'])
print(numeric_cols)

Output:

        date  is_returned
0 2023-01-01        False
1 2024-01-02        False
2 2024-01-03         True
3 2023-01-04        False
4 2023-01-05         True

What’s next?

You can practice more about selecting columns by data type in Pandas DataFrames with larger and more complex datasets provided by Sling Academy:

Good luck!