Sling Academy
Home/Pandas/Explaining pandas.Series.factorize() method through examples

Explaining pandas.Series.factorize() method through examples

Last updated: February 18, 2024

Overview

In data science and analysis, categorizing and encoding features are indispensable tasks to prepare data for models that demand numeric inputs. One excellent tool for this purpose in Python is the pandas library, which provides a versatile method Series.factorize() for encoding categorical variables. This tutorial aims to explain the pandas.Series.factorize() method through a comprehensive set of examples, moving from basic to advanced applications.

Getting Started with pandas.Series.factorize()

The pandas.Series.factorize() method is used to encode input values into numeric codes, creating an array of unique values and assigning a corresponding integer code. This is particularly useful in machine learning preprocessing, where you often need to convert categorical data into a suitable numeric format.

Basic Example

import pandas as pd

# Creating a pandas series
series = pd.Series(['a', 'b', 'a', 'c', 'b', 'a'])

# Applying factorize()
labels, uniques = series.factorize()

print("Labels:", labels)
print("Unique values:", uniques)

This code snippet demonstrates the basic utilization of Series.factorize(), outputting two arrays: ‘labels’, which are the encoded integers, and ‘uniques’, a Pandas Index of unique categories. The resulting output is:

Labels: [0 1 0 2 1 0]
Unique values: Index(['a', 'b', 'c'], dtype='object')

Handling Missing Values

Series.factorize() can also effectively deal with missing values (NaNs). By default, it assigns a unique code to missing values. However, you can alter this behavior using the na_sentinel parameter:

series = pd.Series(['a', 'b', null, 'c', 'b', 'a', null])

# Applying factorize() with na_sentinel
labels, uniques = series.factorize(na_sentinel=-1)

print("Labels:", labels)

Output:

Labels: [0 1 -1 2 1 0 -1]

Sort by Appearance

By default, Series.factorize() sorts the unique values by their first appearance in the input. This behavior benefits maintaining the order of categories as they appear. Yet, you can sort categorically or alphabetically by combining factorize() with additional pandas functions:

series = pd.Series(['a', 'b', 'a', 'c', 'b', 'd'])

# Sorting before factorize
series_sorted = series.sort_values().factorize()

print("Sorted Labels:", series_sorted)

Advanced Usage: Custom Encodings

In more complex scenarios, you might need to map categories to custom numeric codes. While Series.factorize() doesn’t directly support custom mappings, you can achieve this by combining it with other pandas methods:

series = pd.Series(['a', 'b', 'a', 'c', 'd', 'b', 'e'])
unique_vals, _ = series.factorize()

# Mapping to custom codes
mapping = {i: chr(65 + i) for i in range(len(unique_vals))}
custom_labels = series.map(mapping)

print("Custom Encoded Series:", custom_labels)

Output:

Custom Encoded Series: 
0    A
1    B
2    A
3    C
4    D
5    B
6    E
dtype: object

Combining with Machine Learning

The factorized data can be seamlessly integrated into machine learning workflows. For instance, encoded categorical variables can be used as inputs for models. Here’s a hypothetical example with a decision tree classifier:

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd

# Example dataset
df = pd.DataFrame({'Category': ['a', 'b', 'c', 'a', 'b', 'c', 'd'], 'Label': [1, 0, 1, 0, 1, 0, 1]})

# Encoding categoricals
labels, _ = df['Category'].factorize()
df['Category'] = labels

# Splitting dataset
X_train, X_test, y_train, y_test = train_test_split(df['Category'], df['Label'], test_size=0.2, random_state=42)

# Training a model
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train.values.reshape(-1, 1), y_train)

# Predicting
predictions = decision_tree.predict(X_test.values.reshape(-1, 1))
print("Predictions:", predictions)

Conclusion

Throughout this tutorial, we’ve explored the pandas.Series.factorize() method across various scenarios, showcasing its flexibility and power in data encoding and preparation tasks. Whether you are dealing with missing values, need to maintain the order of appearance, require custom encodings, or want to integrate factorized data into machine learning workflows, pandas.Series.factorize() proves to be an invaluable tool. With its simplicity and versatility, it undoubtedly enhances the efficiency and effectiveness of your data pre-processing and exploration activities.

Next Article: Using Pandas Series.kurt() method to compute unbiased kurtosis

Previous Article: Working with pandas.Series.diff() method

Series: Pandas Series: From Basic to Advanced

Pandas

You May Also Like

  • How to Use Pandas Profiling for Data Analysis (4 examples)
  • How to Handle Large Datasets with Pandas and Dask (4 examples)
  • Pandas – Using DataFrame.pivot() method (3 examples)
  • Pandas: How to ‘FULL JOIN’ 2 DataFrames (3 examples)
  • Pandas: Select columns whose names start/end with a specific string (4 examples)
  • 3 ways to turn off future warnings in Pandas
  • How to Integrate Pandas with Apache Spark
  • How to Use Pandas for Web Scraping and Saving Data (2 examples)
  • How to Clean and Preprocess Text Data with Pandas (3 examples)
  • Pandas – Using Series.replace() method (3 examples)
  • Pandas json_normalize() function: Explained with examples
  • Pandas: Reading CSV and Excel files from AWS S3 (4 examples)
  • Using pandas.Series.rank() method (4 examples)
  • Pandas: Dropping columns whose names contain a specific string (4 examples)
  • Pandas: How to print a DataFrame without index (3 ways)
  • Fixing Pandas NameError: name ‘df’ is not defined
  • Pandas – Using DataFrame idxmax() and idxmin() methods (4 examples)
  • Pandas FutureWarning: ‘M’ is deprecated and will be removed in a future version, please use ‘ME’ instead
  • Pandas: Checking equality of 2 DataFrames (element-wise)