Pandas: How to drop all columns that contain non-numerical values

Updated: February 20, 2024 By: Guest Contributor Post a comment

Introduction

In the realm of data analysis with Python, Pandas stands out for its powerful and flexible data manipulation capabilities. A common task in data preprocessing is the removal of columns that contain non-numerical values, as many machine learning models operate solely on numeric input. This tutorial will guide you through various methods to accomplish this task, providing code examples that range from basic to advanced.

Getting Started

Pandas is an open-source library that provides high-performance, easy-to-use data structures, and data analysis tools for Python. Its main data structure is called DataFrame, which can be thought of as a table or a two-dimensional array.

Before diving into the examples, ensure you have Pandas installed:

pip install pandas

Now, let’s begin by importing Pandas:

import pandas as pd

Basic Example

Consider a DataFrame that contains both numerical and non-numerical columns:

import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': ['a', 'b', 'c'],
    'C': [4, 5, 6]
})
print(df)

This will output:

   A  B  C
0  1  a  4
1  2  b  5
2  3  c  6

To drop columns that contain non-numerical values, we use the select_dtypes() method:

df_numeric = df.select_dtypes(include=[np.number])
print(df_numeric)

This will result in:

   A  C
0  1  4
1  2  5
2  3  6

Intermediate Example

Now, let’s handle a more complex DataFrame that includes different types of data:

import numpy as np

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Salary': [50000, 60000, 70000],
    'Gender': ['F', 'M', 'M']
})

# Convert 'Age' to strings to simulate mixed data types
df['Age'] = df['Age'].astype(str)
print(df)

This will output:

      Name Age  Salary Gender
0    Alice  25   50000      F
1      Bob  30   60000      M
2  Charlie  35   70000      M

To exclude non-numeric columns, we again utilize select_dtypes(), this time explicitly excluding object types:

df_numeric = df.select_dtypes(exclude=['object'])
print(df_numeric)

This output will be:

   Salary
0   50000
1   60000
2   70000

Advanced Example

In more complex data scenarios, you may find mixed content within columns. Pandas provides tools to detect and convert these mixed types. Let’s consider a DataFrame with hidden non-numeric values in a seemingly numeric column:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, 'two', 3],
    'B': [4, 5, 6],
    'C': ['x', 7, 'y']
})

# Detecting non-numeric values
columns_to_check = df.columns[df.applymap(np.isreal).all(0)]
print("Columns before conversion:", columns_to_check)

# Attempt conversion
df[columns_to_check] = df[columns_to_check].apply(pd.to_numeric, errors='coerce')

# Drop columns with any NaN values resulting from invalid conversions
df.dropna(axis=1, how='any', inplace=True)
print("\nDataFrame after dropping non-numeric columns:")
print(df)

This procedure first attempts to convert all values in each column to numeric types, marking unconvertible values as NaN. Next, it drops columns that contain any NaN values, effectively removing columns with originally non-numeric entries.

Conclusion

Dropping non-numerical columns in Pandas can be vital for cleaning your dataset before analysis or model training. Through the use of select_dtypes(), and more sophisticated detection and conversion techniques, you can efficiently prepare your data for further processing. Understanding and applying these methods can significantly enhance your data preprocessing workflow.