MongoDB: 4 Ways to Remove Duplicate Documents

Updated: February 3, 2024 By: Guest Contributor Post a comment

Introduction

MongoDB is a powerful NoSQL database used by developers worldwide for its flexibility, scalability, and rich feature set. However, with great power comes great responsibility, including managing duplicate documents that can creep into your database as it grows. This tutorial covers various methods to remove duplicate documents in MongoDB, ranging from simple queries to more complex aggregation processes.

Method 1: Using Distinct and Remove

This method involves identifying unique documents based on a key(s) and then deleting duplicates.

// Identify duplicates based on the 'name' field.
db.collection.distinct('_id', {
  /* query to identify duplicates, e.g., based on 'name' */
});

// Remove duplicates based on criteria, except one.
db.collection.remove({
  /* criteria to match duplicates except one */
});

Output: This will remove all duplicate documents based on the specified criteria, leaving only unique documents.

Method 2: Using Aggregate to Find Duplicates

The aggregation framework provides a powerful way to process data and can be used to identify and delete duplicates.

db.collection.aggregate([
    { $group: {
        _id: { myField: '$myField' },
        uniqueIds: { $addToSet: '$_id' },
        count: { $sum: 1 }
    }},
    { $match: {
        count: { $gt: 1 }
    }}
]).forEach(function(doc) {
    doc.uniqueIds.pop(); // Keep one document and prepare others for deletion
    db.collection.remove({ _id: { $in: doc.uniqueIds } });
});

Output: This operation identifies duplicates based on ‘myField’, keeps one copy, and removes the rest.

Method 3: Using MongoDB Compass

MongoDB Compass is the GUI for MongoDB that can be used to identify and remove duplicates manually.

  1. Navigate to your collection
  2. Use the filter options to identify duplicates
  3. Select and delete the duplicate documents manually

This method is suitable for databases with a lower volume of data or for those who prefer a graphical interface.

Advanced Technique: Writing a Custom Script

If the above methods don’t fully meet your needs, you can write a custom script to identify and remove duplicates. This approach offers the most flexibility.

const MongoClient = require('mongodb').MongoClient;
const url = 'mongodb://localhost:27017';
const dbName = 'mydb';
const client = new MongoClient(url);

async function removeDuplicates() {
    await client.connect();
    console.log('Connected correctly to server');
    const db = client.db(dbName);
    const collection = db.collection('documents');

    const duplicates = [];
    // Example: Identifying duplicates based on 'name'
    const cursor = collection.aggregate([
        { $group: {
            _id: { name: '$name' },
            ids: { $push: '$_id' },
            count: { $sum: 1 }
        }},
        { $match: {
            count: { $gt: 1 }
        }}
    ]);

    await cursor.forEach(doc => {
        doc.ids.pop(); // Keep one document
        if (doc.ids.length) duplicates.push(...doc.ids);
    });

    if (duplicates.length) {
        await collection.deleteMany({ _id: { $in: duplicates } });
        console.log(duplicates.length + ' duplicates were deleted.');
    } else {
        console.log('No duplicates found.');
    }

    await client.close();
}

removeDuplicates().catch(console.error);

Output: Custom script runs and removes identified duplicate documents, outputting the total number of duplicates removed.

Conclusion

Removing duplicate documents in MongoDB is crucial for maintaining the efficiency and accuracy of your database. By implementing one or more of the methods discussed in this tutorial, depending on your needs and database size, you can ensure that your data remains clean and reliable.