MongoDB $bucket aggregation stage: A practical guide

Updated: February 1, 2024 By: Guest Contributor Post a comment

Introduction

The $bucket stage in MongoDB’s aggregation framework is a powerful tool for grouping input documents into buckets based on a specified expression or field. Think of $bucket as a way of creating histograms or categorizing data in ranges that you can define in order to better understand the distribution of your data.

In this guide, we’ll look at practical examples of using the $bucket stage in different scenarios. But first, let’s establish a basic understanding of what the $bucket stage is and its syntax.

Understanding $bucket

The $bucket stage groups incoming documents into buckets based on a field value or an expression evaluating to a numeric value. Each bucket is represented as a separate document in the output of the stage.

Here is the basic syntax:

{
    $bucket: {
        groupBy: ,
        boundaries: [, , ..., ],
        default: ,
        output: {, , ...}
    }
}

Parameters:

  • groupBy: The expression that determines how the documents are grouped into buckets.
  • boundaries: An array of values based on which the buckets are defined. The boundaries cannot overlap and must be specified in ascending order.
  • default: A literal that is a bucket value for all input documents that do not match any other bucket criteria.
  • output: A list of fields that aggregates values for the documents in each bucket.

Prerequisite: Sample Data

Before we dive into examples, let’s assume we have a collection named transactions with the following documents:

[{
  _id: 1, 
  amount: 700,
  transaction_type: "deposit"
},
{
  _id: 2,
  amount: 8000,
  transaction_type: "withdrawal"
},
{
  _id: 3,
  amount: 1500,
  transaction_type: "deposit"
}]

Basic $bucket Example

Suppose we want to group our transactions into buckets based on the amount field. We want to define three buckets: transactions with amounts less than 1000, transactions with amounts between 1000 and 5000, and transactions higher than 5000.

db.transactions.aggregate([
    {
        $bucket: {
            groupBy: "$amount", 
            boundaries: [0, 1000, 5000, 10000],
            default: "Other", 
            output:{
                "count": {$sum: 1},
                "transactions": {$push: "$ROOT"}
            }
        }
    }
])

This aggregation query will output documents representing each bucket and include the count of transactions and the transactions themselves within those buckets.

Complex $bucket Example with Expressions

Now, let’s say we want to group the transactions based not just on the amount, but we want different buckets for each transaction type. To do this, we can use an expression in the groupBy field.

db.transactions.aggregate([
    {
        $bucket: {
            groupBy: {
                    $multiply: [
                        {$cond: {if: {$eq: ["$transaction_type", "deposit"]}, then: 1, else: -1}},
                        "$amount"
                    ]
            },
            boundaries: [-10000, 0, 10000],
            default: "Other",
            output: {
                "count": {$sum: 1},
                "transactions": {$push: "$ROOT"}
            }
        }
    }
])

The $multiply here is used to differentiate deposits from withdrawals by inverting the sign of the amounts for withdrawals. This example assumes withdrawals are represented as positive numbers just like deposits, and we need to convert them for the bucketing process.

Handling Outliers with default

If there are transactions with amounts that don’t fit into our predefined buckets, we should handle them with the default option. For instance:

db.transactions.aggregate([
    {
        $bucket: {
            groupBy: "$amount", 
            boundaries: [1000, 5000, 10000],
            default: "Less than 1000 or More than 10000\n}

Any transactions with amounts less than 1000 or greater than 10000 would be grouped into the “Less than 1000 or More than 10000” bucket.

Summarizing Multiple Fields with output

In addition to counting the transactions in each bucket, you can use the output field to provide more detailed summaries. Here’s an example that calculates both the average and total amount per bucket:

db.transactions.aggregate([
    {
        $bucket: {
            <br>groupBy: "$amount", 
            <br>boundaries: [0, 1000, 5000, 10000],
            <br>default: Job-output-61, 
            <br>output: {
                <br>"count": {$sum: 1},
                <br>"averageAmount": {$avg: "$amount"},
                <br>"totalAmount": {$sum: "$amount"}
            <br>}
        <br>}
    <br>])

As you work with the $bucket stage, keep in mind the following points:

  • Always define boundaries in ascending order without overlaps.
  • The default bucket can be used to capture any data points that do not fit in the predefined buckets.
  • The output field can be as simple or as complex as needed, allowing for a customized summarization of bucket contents.
  • The groupBy field can utilize complex expressions to allow for more intricate bucketing strategies.

By combining the capability to group by complex expressions, handle outliers, and summarize data within each group, the $bucket stage of the MongoDB aggregation pipeline is an indispensable tool for data analysis and the organization of large datasets. Its flexibility in managing numerical ranges and categorization makes it a valuable asset in any MongoDB user’s toolkit.

Conclusion

In conclusion, this guide provided you with a comprehensive look at using MongoDB’s $bucket aggregation stage to categorize and analyze your data. As shown through various examples and explanations, the flexibility and power of $bucket can help you transform your data into organized, digestible segments allowing for deeper insight and improved decision-making.

With the ability to group data by expressions, capture outliers, and summarize information in each bucket, this guide equips you with the knowledge to effectively harness the benefits of $bucket in your MongoDB queries.

As you incorporate $bucket into your workflows, experiment with its many features and see how it can simplify complex data problems. Happy aggregating!