PyMongo: Grouping documents by day/month/year

Updated: February 8, 2024 By: Guest Contributor Post a comment

Overview

Understanding how to effectively group documents by day, month, or year with PyMongo can significantly enhance your ability to analyze time-series data or perform aggregations based on temporal attributes. In MongoDB, the aggregation framework provides a powerful way to process data and perform complex transformations. This tutorial will guide you through various examples, moving from basic techniques to more advanced approaches.

Prerequisites

Before diving into the examples, ensure you have:

  • MongoDB installed and running on your machine or server.
  • The PyMongo driver installed in your Python environment. You can install it using pip install pymongo.
  • A basic understanding of MongoDB and Python.

Basic Document Grouping by Day

To begin grouping documents by their dates, consider a collection named events that contains documents with an eventDate field. Here’s how to aggregate documents by the day:

from pymongo import MongoClient
from datetime import datetime

# Establish connection to the MongoDB server
client = MongoClient('mongodb://localhost:27017/')
db = client['your_db_name']
collection = db['events']

# Construct the pipeline
pipeline = [
    {
        '$match': {
            'eventDate': {
                '$gte': datetime(2023, 1, 1),
                '$lte': datetime(2023, 12, 31)
            }
        }
    },
    {
        '$group': {
            '_id': {
                '$dayOfMonth': '$eventDate'
            },
            'count': {
                '$sum': 1
            }
        }
    },
    {
        '$sort': {
            '_id': 1
        }
    }
]

result = collection.aggregate(pipeline)

for doc in result:
    print(doc)

This simple aggregation pipeline filters documents by eventDate within a specific range and groups them by day of the month, counting the number of documents for each day. The results are then sorted by the day.

Grouping by Month

To group documents by month, you adjust the _id field in the $group stage to use $month instead:

pipeline = [
    {
        '$match': {
            'eventDate': {
                '$gte': datetime(2023, 1, 1),
                '$lte': datetime(2023, 12, 31)
            }
        }
    },
    {
        '$group': {
            '_id': {
                '$month': '$eventDate'
            },
            'count': {
                '$sum': 1
            }
        }
    },
    {
        '$sort': {
            '_id': 1
        }
    }
]

result = collection.aggregate(pipeline)

for doc in result:
    print(doc)

This modification groups the documents by month, still using the eventDate field to determine the grouping criterion.

Grouping by Year

In this example, the events are grouped by year only, ignoring the month and day. This will provide a count of events for each year within the specified date range.

pipeline = [
    {
        '$match': {
            'eventDate': {
                '$gte': datetime.now().replace(year=datetime.now().year - 1),
                '$lte': datetime.now()
            }
        }
    },
    {
        '$group': {
            '_id': {
                'year': {'$year': '$eventDate'}
            },
            'count': {'$sum': 1}
        }
    },
    {
        '$sort': {
            '_id.year': 1
        }
    }
]

result = collection.aggregate(pipeline)

for doc in result:
    print(doc)

Advanced Grouping: Year, Month, and Day

For more detailed analysis, you might want to group documents not just by day or month, but by both, alongside their year. To accomplish this, you combine multiple date operators in the _id field during the $group stage:

pipeline = [
    {
        '$match': {
            'eventDate': {
                '$gte': datetime.now().replace(year=datetime.now().year - 1),
                '$lte': datetime.now()
            }
        }
    },
    {
        '$group': {
            '_id': {
                'year': {'$year': '$eventDate'},
                'month': {'$month': '$eventDate'},
                'day': {'$dayOfMonth': '$eventDate'}
            },
            'count': {'$sum': 1}
        }
    },
    {
        '$sort': {
            '_id.year': 1,
            '_id.month': 1,
            '_id.day': 1
        }
    }
]

result = collection.aggregate(pipeline)

for doc in result:
    print(doc)

This pipeline filters documents within the last year from the current date and groups them by year, month, and day. It provides a fine-grained view of the document distribution over time.

Conclusion

Grouping documents by day, month, or year with PyMongo is a powerful technique to summarize and analyze time-based data. By carefully constructing aggregation pipelines, you can derive meaningful insights from your MongoDB collections. Always consider your specific data schema and analysis needs when building these pipelines.