Pandas: Converting a nested dictionary to a multi-index DataFrame

Updated: February 21, 2024 By: Guest Contributor Post a comment

Introduction

Pandas is an indispensable tool in the pocket of data scientists and analysts for data manipulation and analysis. A common task one might encounter is converting structured data in the form of nested dictionaries into a multi-index DataFrame. This operation lays out data in a tabular format that is easier to work with for analysis and visualization. In this article, we dive deep into converting nested dictionaries into multi-index DataFrames using Pandas, complete with incremental code examples.

Nested dictionaries are a convenient way to store hierarchical data. Converting this structure into a multi-index DataFrame can significantly enhance the readability and make data manipulations easier. Let’s start with the basics and progressively tackle more complex scenarios.

Basic Conversion

Consider the simplest form of a nested dictionary:

{
    'Fruit': {
        'Apple': 10,
        'Banana': 20
    },
    'Vegetable': {
        'Carrot': 15,
        'Potato': 25
    }
}

To convert this into a DataFrame:

import pandas as pd

data = {
    'Fruit': {
        'Apple': 10,
        'Banana': 20
    },
    'Vegetable': {
        'Carrot': 15,
        'Potato': 25
    }
}

df = pd.DataFrame(data)

print(df)

Output:

         Fruit  Vegetable
Apple       10         15
Banana      20         25
Carrot     NaN         15
Potato     NaN         25

Here, the outer keys become column indices, and the inner keys turn into row indices. NaN values indicate missing data.

Going Deeper: Multi-Level Index

For a nested dictionary with more depth, we use a more nuanced method. Consider the following:

{
    'Food': {
        'Fruit': {
            'Apple': 10,
            'Banana': 20
        },
        'Vegetable': {
            'Carrot': 15,
            'Potato': 25
        }
    },
    'Beverage': {
        'Cold': {
            'Soda': 5,
            'Water': 2
        },
        'Hot': {
            'Coffee': 3,
            'Tea': 4
        }
    }
}

Converting this structure:

import pandas as pd

data = {
    'Food': {
        'Fruit': {
            'Apple': 10,
            'Banana': 20
        },
        'Vegetable': {
            'Carrot': 15,
            'Potato': 25
        }
    },
    'Beverage': {
        'Cold': {
            'Soda': 5,
            'Water': 2
        },
        'Hot': {
            'Coffee': 3,
            'Tea': 4
        }
    }
}

df = pd.DataFrame(data).stack().unstack()
print(df)

Produces a multi-index DataFrame with hierarchies in both rows and columns.

Advanced Manipulations

For more complex structures or further customization of the DataFrame, Pandas provides a variety of methods. Let’s say we want to add another level of index based on year:

import pandas as pd

data = {
    2020: {
        'Food': {
            'Fruit': {
                'Apple': 10,
                'Banana': 20
            },
            'Vegetable': {
                'Carrot': 15,
                'Potato': 25
            }
        },
        'Beverage': {
            'Cold': {
                'Soda': 5,
                'Water': 2
            },
            'Hot': {
                'Coffee': 3,
                'Tea': 4
            }
        }
    },
    2021: {
        'Food': {
            'Fruit': {
                'Apple': 12,
                'Banana': 22
            },
            'Vegetable': {
                'Carrot': 18,
                'Potato': 30
            }
        },
        'Beverage': {
            'Cold': {
                'Soda': 6,
                'Water': 3
            },
            'Hot': {
                'Coffee': 4,
                'Tea': 5
            }
        }
    }
}

# Convert to multi-index DataFrame
years = list(data.keys())
multi_df = pd.concat({year: pd.DataFrame(data[year]).stack().unstack() for year in years}, axis=0)
multi_df.index.names = ['Year', 'Type']
print(multi_df)

This creates a more complex, multi-dimensional data structure that better represents our dataset’s hierarchical nature.

Conclusion

Converting nested dictionaries to multi-index DataFrames enhances data manipulability and lays it out in a format that’s easier to analyze and visualize. Pandas, with its comprehensive functionalities, makes these conversions straightforward, allowing for a wide range of manipulations to suit virtually any data analysis scenario.