Pandas: How to drop unused levels in a MultiIndex

Introduction
Understanding MultiIndex
Identifying Unused Levels
How to Drop Unused Levels
Advanced Example
Conclusion

Introduction

Working with hierarchical indices, or MultiIndexes, in pandas can significantly enhance your data analysis capabilities, providing a way to handle higher-dimensional data in a lower-dimensional form. However, after slicing or filtering your DataFrame, you might end up with several unused levels in your MultiIndex. These surplus levels can sometimes lead to confusion and might make your output less readable. Fortunately, pandas offer a straightforward way to remove these unused levels, improving both the performance and readability of your DataFrames.

Understanding MultiIndex

Before diving into the specifics of dropping unused levels, let’s first understand what a MultiIndex is. A MultiIndex, simply put, is an index structure that allows you to have multiple levels or dimensions. This is particularly useful when working with data that has more than one hierarchical level.

import pandas as pd
import numpy as np

# Sample data creation
arrays = [["one", "one", "two", "two"], ["A", "B", "A", "B"]]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
s = pd.Series(np.random.randn(4), index=index)
print(s)

This creates a Series with a MultiIndex, where ‘one’ and ‘two’ are the top-level indices, and ‘A’ and ‘B’ are the second-level indices. The `print(s)` command gives us the structure of our hierarchical index.

Identifying Unused Levels

After performing operations like slicing, you might find that some levels of your MultiIndex no longer have any corresponding data points. These are the levels that we refer to as ‘unused.’

# Let's say we slice the Series to keep only 'one' at the top level
s_filtered = s.loc["one"]
print(s_filtered.index.levels)

This code snippet filters the Series to keep rows where the top-level index is ‘one’. If we inspect the index levels, we will see that the second-level indices (‘A’ and ‘B’) are still there, even though only ‘A’ and ‘B’ corresponding to ‘one’ are in use.

How to Drop Unused Levels

The process of removing these unused levels is remarkably simple in pandas. The method we are looking for is `remove_unused_levels()`.

# Removing unused levels
s_filtered.index = s_filtered.index.remove_unused_levels()
print(s_filtered.index.levels)

After applying `remove_unused_levels()`, the output will show that our MultiIndex now only includes the levels that are actually in use. This action can make the DataFrame or Series easier to work with and can sometimes even reduce memory consumption.

Advanced Example

Let’s take a slightly more complex example involving a DataFrame with multiple columns, and see how removing unused levels can be beneficial after filtering data.

df = pd.DataFrame({'A': ["one", "one", "two", "three"], 'B': ["A", "B", "A", "B"], 'values': np.random.rand(4)})
df.set_index(['A', 'B'], inplace=True)

# Slicing
sliced_df = df.loc[df.index.get_level_values("A").isin(["one", "two"])]

print("Before:", sliced_df.index.levels, sep='\n')

# Dropping Unused Levels
sliced_df.index = sliced_df.index.remove_unused_levels()

print("After:", sliced_df.index.levels, sep='\n')

In this example, we start with a DataFrame indexed by ‘A’ and ‘B’. After filtering to keep only the entries where ‘A’ is either ‘one’ or ‘two’, we see that initially, the ‘three’ level under ‘A’ is still part of the DataFrame’s index structure. After applying `remove_unused_levels()`, we successfully eliminate the unused ‘three’ level, streamlining our data structure.

Conclusion

Dropping unused levels in a MultiIndex can simplify the structure of your DataFrame or Series, making it more readable and potentially improving performance. Whether you’re dealing with simple or complex hierarchical data structures, understanding how to efficiently clean up your indices will significantly enhance your pandas workflow.

Next Article: Pandas: Selecting all columns except some from a DataFrame (4 ways)

Previous Article: Pandas: Checking if a row exists in a DataFrame

Series: DateFrames in Pandas

Pandas

How to Use Pandas for Geospatial Data Analysis (3 examples)

February 28, 2024