Sling Academy
Home/MongoDB/MongoDB: 4 Ways to Remove Duplicate Documents

MongoDB: 4 Ways to Remove Duplicate Documents

Last updated: February 03, 2024

Introduction

MongoDB is a powerful NoSQL database used by developers worldwide for its flexibility, scalability, and rich feature set. However, with great power comes great responsibility, including managing duplicate documents that can creep into your database as it grows. This tutorial covers various methods to remove duplicate documents in MongoDB, ranging from simple queries to more complex aggregation processes.

Method 1: Using Distinct and Remove

This method involves identifying unique documents based on a key(s) and then deleting duplicates.

// Identify duplicates based on the 'name' field.
db.collection.distinct('_id', {
  /* query to identify duplicates, e.g., based on 'name' */
});

// Remove duplicates based on criteria, except one.
db.collection.remove({
  /* criteria to match duplicates except one */
});

Output: This will remove all duplicate documents based on the specified criteria, leaving only unique documents.

Method 2: Using Aggregate to Find Duplicates

The aggregation framework provides a powerful way to process data and can be used to identify and delete duplicates.

db.collection.aggregate([
    { $group: {
        _id: { myField: '$myField' },
        uniqueIds: { $addToSet: '$_id' },
        count: { $sum: 1 }
    }},
    { $match: {
        count: { $gt: 1 }
    }}
]).forEach(function(doc) {
    doc.uniqueIds.pop(); // Keep one document and prepare others for deletion
    db.collection.remove({ _id: { $in: doc.uniqueIds } });
});

Output: This operation identifies duplicates based on ‘myField’, keeps one copy, and removes the rest.

Method 3: Using MongoDB Compass

MongoDB Compass is the GUI for MongoDB that can be used to identify and remove duplicates manually.

  1. Navigate to your collection
  2. Use the filter options to identify duplicates
  3. Select and delete the duplicate documents manually

This method is suitable for databases with a lower volume of data or for those who prefer a graphical interface.

Advanced Technique: Writing a Custom Script

If the above methods don’t fully meet your needs, you can write a custom script to identify and remove duplicates. This approach offers the most flexibility.

const MongoClient = require('mongodb').MongoClient;
const url = 'mongodb://localhost:27017';
const dbName = 'mydb';
const client = new MongoClient(url);

async function removeDuplicates() {
    await client.connect();
    console.log('Connected correctly to server');
    const db = client.db(dbName);
    const collection = db.collection('documents');

    const duplicates = [];
    // Example: Identifying duplicates based on 'name'
    const cursor = collection.aggregate([
        { $group: {
            _id: { name: '$name' },
            ids: { $push: '$_id' },
            count: { $sum: 1 }
        }},
        { $match: {
            count: { $gt: 1 }
        }}
    ]);

    await cursor.forEach(doc => {
        doc.ids.pop(); // Keep one document
        if (doc.ids.length) duplicates.push(...doc.ids);
    });

    if (duplicates.length) {
        await collection.deleteMany({ _id: { $in: duplicates } });
        console.log(duplicates.length + ' duplicates were deleted.');
    } else {
        console.log('No duplicates found.');
    }

    await client.close();
}

removeDuplicates().catch(console.error);

Output: Custom script runs and removes identified duplicate documents, outputting the total number of duplicates removed.

Conclusion

Removing duplicate documents in MongoDB is crucial for maintaining the efficiency and accuracy of your database. By implementing one or more of the methods discussed in this tutorial, depending on your needs and database size, you can ensure that your data remains clean and reliable.

Next Article: MongoDB: Using db.runCommand() to execute database commands

Previous Article: MongoDB: How to View Error and Query Logs

Series: MongoDB Tutorials

MongoDB

You May Also Like

  • MongoDB: How to combine data from 2 collections into one
  • Hashed Indexes in MongoDB: A Practical Guide
  • Partitioning and Sharding in MongoDB: A Practical Guide (with Examples)
  • Geospatial Indexes in MongoDB: How to Speed Up Geospatial Queries
  • Understanding Partial Indexes in MongoDB
  • Exploring Sparse Indexes in MongoDB (with Examples)
  • Using Wildcard Indexes in MongoDB: An In-Depth Guide
  • Matching binary values in MongoDB: A practical guide (with examples)
  • Understanding $slice operator in MongoDB (with examples)
  • Caching in MongoDB: A practical guide (with examples)
  • CannotReuseObject Error: Attempted illegal reuse of a Mongo object in the same process space
  • How to perform cascade deletion in MongoDB (with examples)
  • MongoDB: Using $not and $nor operators to negate a query
  • MongoDB: Find SUM/MIN/MAX/AVG of each group in a collection
  • References (Manual Linking) in MongoDB: A Developer’s Guide (with Examples)
  • MongoDB: How to see all fields in a collection (with examples)
  • Type checking in MongoDB: A practical guide (with examples)
  • How to query an array of subdocuments in MongoDB (with examples)
  • MongoDB: How to compare 2 documents (with examples)