This concise and straightforward article shows you how to select columns by data type in a DataFrame in Pandas.
What is the point?
In Pandas, we can select columns based on their data types using the select_dtypes() method. This method returns a new DataFrame containing only columns of the specified data type. We can specify the data type we want to select using the include or exclude parameters.
If you want to select only columns with numerical data types, you can use:
df.select_dtypes(include=['number'])
In case you want to select only columns with string data types, you can use:
df.select_dtypes(include=['object'])
You can also combine multiple data types using a list or a tuple. For example, if you want to retrieve columns with integer or boolean data types, use this:
df.select_dtypes(include=['int', 'bool'])
The exclude parameter is helpful when you need to specify which data types to exclude from the selection. For example, if you want to select all columns except those with datetime data types, you can use:
df.select_dtypes(exclude=['datetime'])
For a better and deeper understanding, see the real-world examples below.
Examples
Example 1
Suppose we have a DataFrame containing columns of different data types:
import pandas as pd
data = {'name': ['John', 'Jane', 'Bob'],
'age': [25, 30, 35],
'net_worth': [1000, 20000, 300000],
'is_investor': [False, False, True]}
df = pd.DataFrame(data)
To select only the columns with numeric data types, we call the select_dtypes() method with the include parameter set to number like so:
numeric_cols = df.select_dtypes(include='number')
print(numeric_cols)
Output:
age net_worth
0 25 1000
1 30 20000
2 35 300000
To select only the non-numeric columns (object, bool, category), we can use the select_dtypes() method with the exclude parameter set to number:
non_numeric_cols = df.select_dtypes(exclude='number')
print(non_numeric_cols)
Output:
name is_investor
0 John False
1 Jane False
2 Bob True
Example 2
Let’s say we have a dataset of customer transactions containing columns of different data types:
import pandas as pd
data = {
'transaction_id': [101, 102, 103, 104, 105],
'date': ['2023-01-01', '2024-01-02', '2024-01-03', '2023-01-04', '2023-01-05'],
'customer_id': ['C001', 'C002', 'C003', 'C004', 'C005'],
'product_name': ['A', 'B', 'C', 'A', 'B'],
'price': [10.5, 20.0, 5.5, 7.0, 18.5],
'quantity': [2, 1, 3, 4, 2],
'is_returned': [False, False, True, False, True]
}
df = pd.DataFrame(data)
# Convert the date column to datetime type
df['date'] = pd.to_datetime(df['date'])
Suppose we are only interested in the dates of transactions and whether those transactions are from old customers or not, we will do the following:
numeric_cols = df.select_dtypes(include=['datetime64', 'bool'])
print(numeric_cols)
Output:
date is_returned
0 2023-01-01 False
1 2024-01-02 False
2 2024-01-03 True
3 2023-01-04 False
4 2023-01-05 True
What’s next?
You can practice more about selecting columns by data type in Pandas DataFrames with larger and more complex datasets provided by Sling Academy:
Good luck!