Pandas json_normalize() function: Explained with examples

Overview
1. What is JSON?
Getting Started with json_normalize()
Conclusion

Overview

The json_normalize() function in Pandas is a powerful tool for flattening JSON objects into a flat table. Unlike traditional methods of dealing with JSON data, which often require nested loops or verbose transformations, json_normalize() simplifies the process, making data analysis and manipulation more straightforward. In this tutorial, you will learn how to effectively use json_normalize() through three progressively advanced examples.

What is JSON?

JSON (JavaScript Object Notation) is a lightweight data-interchange format that is easy for humans to read and write and for machines to parse and generate. JSON is built on two structures:

A collection of name/value pairs. In various languages, this is realized as an object, record, struct, dictionary, hash table, keyed list, or associative arrays.
An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence.

Getting Started with json_normalize()

Before diving into examples, let’s discuss the setting up process for using json_normalize(). Ensure you have the latest Pandas library installed, as improvements are continually made to json_normalize(). You can install or upgrade Pandas with:

pip install pandas --upgrade

Example 1: Basic Usage

The first example involves a simple JSON object:

{
    "data": [
        {"id": 1, "name": "John Doe"},
        {"id": 2, "name": "Jane Doe"}
    ]
}

To normalize this JSON, you use:

import pandas as pd

json_data = {
    'data': [
        {'id': 1, 'name': 'John Doe'}, 
        {'id': 2, 'name': 'Jane Doe'}
    ]
}
df = pd.json_normalize(json_data, 'data')
print(df)

The output will be a Pandas DataFrame:

   id      name
0  1  John Doe
1  2  Jane Doe

This demonstrates the basic functionality of json_normalize(), transforming a nested JSON object into a flat data structure. It’s particularly useful for extracting data nested under a single key.

Example 2: Dealing With Nested Data

Our second example delves into a more complex JSON object that contains nested data:

{
    "user": {
        "id": 123, 
        "info": {
            "name": "John Doe", 
            "email": "[email protected]"
        }, 
        "roles": ["admin", "user"]
    }
}

To normalize this data while including the nested ‘info’ and ‘roles’, you use:

json_data = {
    'user': {
        'id': 123, 
        'info': {
            'name': 'John Doe', 
            'email': '[email protected]'
        }, 
        'roles': ['admin', 'user']
    }
}
df = pd.json_normalize(
    json_data, 
    record_path=None, 
    meta=[
        'user.id', 
        ['user', 'info', 'name'], 
        ['user', 'info', 'email'], 
        'user.roles'
    ]
)
print(df)

The output is more complex but neatly organized into a DataFrame:

Empty DataFrame
Columns: [user.id, user.info.name, user.info.email, user.roles]
Rows: 0

This demonstrates how json_normalize() can handle deeply nested structures by specifying the path to the data and meta-information to include additional details at each level.

Example 3: Advanced Data Transformations

In our final example, let’s consider a JSON structure with multiple levels of nesting and arrays at different levels:

{
    "company": "Example Corp", 
    "employees": [
        {
            "id": 1, 
            "name": "John Doe", 
            "positions": [
                {"title": "CEO", "years": 5}, 
                {"title": "CTO", "years": 3}
            ]
        }, 
        {
            "id": 2, 
            "name": "Jane Doe", 
            "positions": [
                {"title": "CFO", "years": 4}, 
                {"title": "CMO", "years": 2}
            ]
        }
    ]
}

For this complex dataset, the strategy includes combining json_normalize() with additional Pandas operations for further manipulation:

import pandas as pd

json_data = {
    'company': 'Example Corp', 
    'employees': [
        {
            'id': 1, 
            'name': 'John Doe', 
            'positions': [
                {'title': 'CEO', 'years': 5}, 
                {'title': 'CTO', 'years': 3}
            ]
        }, 
        {
            'id': 2, 
            'name': 'Jane Doe', 
            'positions': [
                {'title': 'CFO', 'years': 4}, 
                {'title': 'CMO', 'years': 2}
            ]
        }
    ]
}

df = pd.json_normalize(
    json_data, 
    record_path=['employees', 'positions'], 
    meta=[
        ['employees', 'id'], 
        ['employees', 'name']
    ]
)
print(df)

This approach allows for the normalization of complex, nested JSON data, converting it into a user-friendly DataFrame format. This example demonstrates the flexibility and power of json_normalize() for handling intricate JSON structures.

Conclusion

Throughout this tutorial, we’ve explored the capabilities of the json_normalize() function in Pandas, from basic to advanced applications. These examples demonstrate that, regardless of the complexity of your JSON data, json_normalize() can be an invaluable tool for transforming it into a manageable format, making data analysis significantly more accessible.

Next Article: Pandas: How to select a part of an SQLite table as a DataFrame

Previous Article: Using Pandas with HDFStore: The Complete Guide

Series: DateFrames in Pandas

Pandas

How to Use Pandas for Geospatial Data Analysis (3 examples)

February 28, 2024