PyMongo: How to use the aggregation pipeline (sum, avg, count, min, max)

Overview
Getting Started
Basic Aggregation Operations
Advanced Aggregation Operations
1. Conditional Summation and Averages
2. Working with Arrays and Subdocuments
Conclusion

Overview

Aggregation in MongoDB is a powerful feature for performing complex data analysis directly on the database. PyMongo, the Python driver for MongoDB, provides robust support for the MongoDB aggregation pipeline, enabling developers to execute a series of operations and achieve data transformation and summary. This tutorial will guide you through using the aggregation pipeline in PyMongo, focusing on operations such as sum, avg, count, min, and max. We will start with basic examples and move on to more advanced use cases.

Getting Started

Before diving into the aggregation pipeline, ensure your MongoDB server is running, and you’ve installed PyMongo in your Python environment:

pip install pymongo

Connect to your MongoDB database:

from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017/')
db = client['your_database']
collection = db['your_collection']

Basic Aggregation Operations

Aggregation operations can be quite useful for summarizing, analyzing, or transforming the data stored in your MongoDB collections. Let’s start with the basics:

Counting Documents

To count the number of documents in a collection, we can use the count_documents method, which is straightforward. However, to demonstrate the use of the aggregation pipeline for counting, we’ll perform a count operation within it:

pipeline = [
    {'$match': {}},  # Match all documents
    {'$count': 'total_documents'}
]
results = collection.aggregate(pipeline)
print(list(results))

This will return the total number of documents in the collection. It’s a simple way to get started with aggregation.

Calculating Sum

For summing values of a specified field across documents, we use the $group stage with the $sum operator:

pipeline = [
    {'$group': {
        '_id': null,  # Group all documents together
        'total_sales': {'$sum': '$sales'}
    }}
]
results = collection.aggregate(pipeline)
print(list(results))

This pipeline adds up the sales field value of each document in the collection and outputs the total sales.

Calculating Average

We calculate averages using the $group stage with the $avg operator, similar to how we calculated sums:

pipeline = [
    {'$group': {
        '_id': null,
        'average_price': {'$avg': '$price'}
    }}
]
results = collection.aggregate(pipeline)
print(list(results))

This will calculate and return the average price of all documents where the price field exists.

Finding Minimum and Maximum Values

To find the minimum or maximum value of a field across all documents, use the $min and $max operators:

pipeline = [
    {'$group': {
        '_id': null,
        'min_price': {'$min': '$price'},
        'max_price': {'$max': '$price'}
    }}
]
results = collection.aggregate(pipeline)
print(list(results))

This pipeline will return the minimum and maximum price in the collection.

Advanced Aggregation Operations

As you become more familiar with the basics, you can start exploring more complex aggregation operations. This involves combining multiple stages, using conditional operations, and working with arrays and subdocuments.

Conditional Summation and Averages

Conditional aggregation allows for more precise control over which documents or parts of documents contribute to the result:

pipeline = [
    {'$group': {
        '_id': '$category',
        'total_sales': {
            '$sum': {
                '$cond': [{'$gte': ['$price', 100]}, '$sales', 0]
            }
        },
        'average_price': {
            '$avg': {
                '$cond': [{'$gt': ['$price', 0]}, '$price', 'NONE']
            }
        }
    }}
]
results = collection.aggregate(pipeline)
print(list(results))

This pipeline calculates the total sales and average price, but only for items where the price is >= 100 for total sales, and > 0 for average price, shedding light on higher-value transactions.

Working with Arrays and Subdocuments

Aggregating data that includes arrays or subdocuments requires a slightly different approach. Using operators like $unwind for arrays and $project for reshaping can be very useful:

pipeline = [
    {'$unwind': '$tags'},  # Unwind the arrays
    {'$group': {
        '_id': '$tags',
        'count': {'$sum': 1}
    }}
]
results = collection.aggregate(pipeline)
print(list(results))

This pipeline counts the occurrence of each distinct tag found in the documents’ tags array.

Conclusion

The MongoDB aggregation pipeline is a potent tool for data analysis and transformation, and PyMongo makes it accessible and useful within Python applications. By starting with basic operations and progressively tackling more complex tasks, developers can leverage this functionality to derive meaningful insights from their data collections.

Next Article: PyMongo: How to select/count distinct documents

Previous Article: How to translate MongoDB shell syntax to PyMongo code

Series: Data Persistence in Python – Tutorials & Examples

Python