Sling Academy
Home/MongoDB/MongoDB: Counting distinct values in each group

MongoDB: Counting distinct values in each group

Last updated: February 03, 2024

Introduction

Counting distinct values is a fundamental data aggregation task that can be complex depending on the structure of your data. MongoDB offers a flexible aggregation framework to carry out this operation effectively. In this tutorial, we’ll walk you through the steps of counting distinct values in each group using MongoDB.

Understanding $group and $addToSet

Before diving deep into counting distinct values, let’s understand the ‘$group’ stage in the MongoDB aggregation pipeline. The ‘$group’ stage groups input documents by a specified identifier expression and applies accumulator expressions to each group. ‘$addToSet’ is an accumulator operator that adds unique values to a set.

db.collection.aggregate([
    { $group: {
        _id: '$fieldToGroupBy',
        uniqueValues: { $addToSet: '$fieldToGetDistinctValuesFrom' }
    }}
])

This pattern groups documents by ‘fieldToGroupBy’ and accumulates unique ‘fieldToGetDistinctValuesFrom’ values into an array for each group.

Counting Distinct Values with $size

Now that we have groups with unique values, to count them, we can use another operator ‘$size’ that returns the number of elements in an array.

db.collection.aggregate([
    { $group: {
        _id: '$fieldToGroupBy',
        uniqueValues: { $addToSet: '$fieldToGetDistinctValuesFrom' }
    }},
    { $project: {
        countDistinct: { $size: '$uniqueValues' }
    }}
])

The ‘$project’ stage here computes the number of distinct values by getting the size of the ‘uniqueValues’ array.

Example: Group by Category

Consider a products collection where each document represents a product with a category and supplier. To count distinct suppliers for each category:

db.products.aggregate([
    { $group: {
        _id: '$category',
        distinctSuppliers: { $addToSet: '$supplier' }
    }},
    { $project: {
        numberOfSuppliers: { $size: '$distinctSuppliers' }
    }}
])

Output would look something like this:

[
    { _id: "Electronics", numberOfSuppliers: 5 },
    { _id: "Clothing", numberOfSuppliers: 3 },
    ...
]

Handling Large Sets and Performance

While using ‘$addToSet’ and ‘$size’ works well for small datasets, when dealing with large datasets, this can lead to performance issues since ‘$addToSet’ holds all unique values in memory. A better approach for large sets is to use the ‘$sum’ accumulator with a sub-pipeline that handles the distinct count using ‘$group’.

db.collection.aggregate([
    { $group: {
        _id: {
            groupField: '$fieldToGroupBy',
            distinctField: '$fieldToGetDistinctValuesFrom'
        }
    }},
    { $group: {
        _id: '$_id.groupField',
        countDistinct: { $sum: 1 }
    }}
])

This approach first groups documents by a compound key including the group field and the distinct values. Subsequently, it regroups by just the group field, and counts each unique value once.

Indexing and Performance Optimization

For performance gains when counting distinct values, ensure proper indexing. Creating a compound index on the fields used for ‘$group’ can significantly improve aggregation speed.

db.collection.createIndex({ fieldToGroupBy: 1, fieldToGetDistinctValuesFrom: 1 })

After indexing, the aggregation queries will perform efficiently, even on larger datasets.

Use Case: Count Distinct Status Codes

Let’s consider a more concrete example: a logs collection with fields for userId and statusCode. We want to count how many distinct status codes each user has encountered.

db.logs.aggregate([
    { $group: {
        _id: {
            user: '$userId',
            code: '$statusCode'
        }
    }},
    { $group: {
        _id: '$_id.user',
        uniqueStatusCodes: { $sum: 1 }
    }}
])

Results would display the count of distinct status codes experienced by each user.

Advanced Techniques: Using $facet for Multiple Aggregations

When you need to perform multiple distinct counts in one query or gather additional statistics, ‘$facet’ provides a way to execute several aggregation pipelines within a single stage and return the results in a single document.

db.collection.aggregate([
    {
        $facet: {
            "statusCounts": [
                { $group: { _id: "$status", count: { $sum: 1 } } }
            ],
            "distinctUsersPerStatus": [
                { $group: { _id: "$status", users: { $addToSet: "$user" } } },
                { $project: { count: { $size: "$users" } } }
            ]
        }
    }
])

This query produces two separate counts: one counting the occurrences of each status and one counting the distinct users per status.

Conclusion

Counting distinct values in each group in MongoDB is a powerful technique for data analysis. It can be straightforward or involve more complex aggregation, especially for performance optimization in large data sets. Indexing is key for the best performance.

Next Article: MongoDB: How to compare 2 documents (with examples)

Previous Article: MongoDB: How to select distinct values from a collection (with examples)

Series: MongoDB Tutorials

MongoDB

You May Also Like

  • MongoDB: How to combine data from 2 collections into one
  • Hashed Indexes in MongoDB: A Practical Guide
  • Partitioning and Sharding in MongoDB: A Practical Guide (with Examples)
  • Geospatial Indexes in MongoDB: How to Speed Up Geospatial Queries
  • Understanding Partial Indexes in MongoDB
  • Exploring Sparse Indexes in MongoDB (with Examples)
  • Using Wildcard Indexes in MongoDB: An In-Depth Guide
  • Matching binary values in MongoDB: A practical guide (with examples)
  • Understanding $slice operator in MongoDB (with examples)
  • Caching in MongoDB: A practical guide (with examples)
  • CannotReuseObject Error: Attempted illegal reuse of a Mongo object in the same process space
  • How to perform cascade deletion in MongoDB (with examples)
  • MongoDB: Using $not and $nor operators to negate a query
  • MongoDB: Find SUM/MIN/MAX/AVG of each group in a collection
  • References (Manual Linking) in MongoDB: A Developer’s Guide (with Examples)
  • MongoDB: How to see all fields in a collection (with examples)
  • Type checking in MongoDB: A practical guide (with examples)
  • How to query an array of subdocuments in MongoDB (with examples)
  • MongoDB: How to compare 2 documents (with examples)