Sling Academy
Home/Pandas/Pandas: How to create a categorical column in a DataFrame

Pandas: How to create a categorical column in a DataFrame

Last updated: February 23, 2024

Pandas, a powerful and widely used data manipulation library in Python, provides numerous functionalities for dealing with structured data. One of the key features of Pandas is its ability to handle categorical data efficiently. In this tutorial, we will explore how to create a categorical column in a DataFrame, moving from basic techniques to more advanced applications.

Understanding Categorical Data

Before diving into creating categorical columns, it’s crucial to understand what categorical data is. Categorical data represents variables with a finite set of categories or distinct groups. These categories can be of two types: nominal (no natural order) or ordinal (a natural order exists).

Getting Started with Pandas

First, ensure you have Pandas installed. You can install Pandas using pip:

pip install pandas

Creating a Simple Categorical Column

Let’s start with a basic example. Suppose you have a DataFrame that contains the column ‘Color’ with different colors as its values:

import pandas as pd

df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']})
print(df)

Output:

   Color
0    Red
1   Blue
2  Green
3   Blue
4    Red

To convert this column into a categorical column, you can use the astype('category') method:

df['Color'] = df['Color'].astype('category')
print(df['Color'].dtype)

Output:

CategoricalDtype(categories=['Blue', 'Green', 'Red'], ordered=False)

Setting Order in Categorical Data

In some cases, categorical data has a natural order. Using the pandas.Categorical class, you can specify this order. For example, suppose you have a DataFrame with a ‘Size’ column:

import pandas as pd

df = pd.DataFrame({'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small']})
df['Size'] = pd.Categorical(df['Size'], categories=['Small', 'Medium', 'Large'], ordered=True)
print(df['Size'])
print("Order: ", df['Size'].cat.categories)

Output:

0     Small
1    Medium
2     Large
3    Medium
4     Small
Name: Size, dtype: category
Order:  Index(['Small', 'Medium', 'Large'], dtype='object')

Using Categorical Data for Analysis

With categorical data now defined in your DataFrame, you can perform various analyses more efficiently. For example, grouping data based on a categorical column:

df.groupby('Size').size()

Output:

Size
Small     2
Medium    2
Large     1
dtype: int64

Advanced Techniques

As you become more familiar with handling categorical data, you might encounter situations requiring more advanced techniques. For instance, you could use category codes to represent each category with a unique integer, which is useful for certain types of analyses and visualizations:

df['Size_code'] = df['Size'].cat.codes
print(df)

Output:

     Size  Size_code
0   Small          0
1  Medium          1
2   Large          2
3  Medium          1
4   Small          0

Another advanced technique involves creating a custom categorical type to manage unobserved categories or establish a category hierarchy for multi-level analyses.

Conclusion

In conclusion, transforming a column into a categorical column in a Pandas DataFrame allows for more efficient data processing and analysis. By leveraging the categorical data type, you can optimize memory usage and improve performance when working with data that fits into well-defined categories. Start applying these techniques to make your data analysis processes more robust and efficient.

Next Article: Pandas: Generating an Ordering Categorical Series

Previous Article: Pandas DataFrame: Convert all string values to binary

Series: DateFrames in Pandas

Pandas

You May Also Like

  • How to Use Pandas Profiling for Data Analysis (4 examples)
  • How to Handle Large Datasets with Pandas and Dask (4 examples)
  • Pandas – Using DataFrame.pivot() method (3 examples)
  • Pandas: How to ‘FULL JOIN’ 2 DataFrames (3 examples)
  • Pandas: Select columns whose names start/end with a specific string (4 examples)
  • 3 ways to turn off future warnings in Pandas
  • How to Integrate Pandas with Apache Spark
  • How to Use Pandas for Web Scraping and Saving Data (2 examples)
  • How to Clean and Preprocess Text Data with Pandas (3 examples)
  • Pandas – Using Series.replace() method (3 examples)
  • Pandas json_normalize() function: Explained with examples
  • Pandas: Reading CSV and Excel files from AWS S3 (4 examples)
  • Using pandas.Series.rank() method (4 examples)
  • Pandas: Dropping columns whose names contain a specific string (4 examples)
  • Pandas: How to print a DataFrame without index (3 ways)
  • Fixing Pandas NameError: name ‘df’ is not defined
  • Pandas – Using DataFrame idxmax() and idxmin() methods (4 examples)
  • Pandas FutureWarning: ‘M’ is deprecated and will be removed in a future version, please use ‘ME’ instead
  • Pandas: Checking equality of 2 DataFrames (element-wise)