MySQL 8: 3 ways to remove duplicate rows from a table

Updated: January 26, 2024 By: Guest Contributor Post a comment

Introduction

Managing data in MySQL often involves ensuring data integrity and uniqueness. One frequent challenge faced by developers and database administrators is removing duplicate rows from a table. Fortunately, MySQL provides multiple ways to handle this situation efficiently. In this guide, we’ll explore a variety of strategies to remove duplicates from a MySQL table, providing you with the knowledge to choose the best approach for your specific scenario.

Solution 1: Distinct Keyword + Temporary Table

Use the DISTINCT keyword to select unique rows and insert them into a temporary table. Then, delete the original table and rename the temporary table.

  • Create a temporary table with distinct rows from the original table.
  • Delete the original table.
  • Rename the temporary table to the original table’s name.

Example:

-- Create temporary table with unique records
CREATE TABLE mytable_temp AS
SELECT DISTINCT *
FROM mytable;

-- Remove original table
DROP TABLE mytable;

-- Rename temporary table to original
RENAME TABLE mytable_temp TO mytable;

Notes: This method is simple and effective for smaller tables. However, for larger tables, this can be a time-consuming operation and will require enough disk space for the temporary table. Also, if your table has indexes, triggers or foreign key constraints, you’ll need to recreate them manually after renaming the temporary table.

Solution 2: DELETE JOIN

Utilize a self JOIN in a DELETE statement to identify and remove duplicates, preserving only a single copy of each unique row based on specific criteria.

  • Identify the uniqueness criteria and select one row to keep.
  • Perform a self JOIN with a DELETE to remove duplicates.

Example:

-- Delete duplicate rows but keep one instance
DELETE t1 FROM mytable t1
INNER JOIN mytable t2
WHERE
    t1.id > t2.id AND
    t1.duplicate_field = t2.duplicate_field;

Notes: This method is more efficient than the temporary table strategy, especially for large datasets. However, the choice of uniqueness criteria is crucial for this approach to work correctly. The operation can be resource-intensive on the database server and should be used with caution for tables with a significant number of rows or during non-peak hours to minimize impact on performance.

Solution 3: GROUP BY and HAVING Clauses

Use GROUP BY and HAVING clauses to identify duplicates, and then perform a delete operation to remove the excess rows.

  • Identify duplicate rows using a GROUP BY clause.
  • Use the results to delete duplicates from the original table.

Example:

-- First, find the row IDs of duplicates
SELECT MIN(id) as min_id FROM mytable
GROUP BY duplicate_field
HAVING COUNT(*) > 1;

-- Then, delete the duplicate rows
DELETE FROM mytable
WHERE id NOT IN (SELECT min_id FROM (SELECT MIN(id) AS min_id FROM mytable GROUP BY duplicate_field HAVING COUNT(*) > 1) AS tmp);

Notes: This solution is beneficial when dealing with tables that contain a few distinct duplicates, as it may become inefficient with numerous duplicates. For large tables, it can cause performance issues since a subquery is used to identify the rows to delete. Also, be cautious of locking issues if the table is being used concurrently by others during this operation.

Summary

In conclusion, there are multiple ways to remove duplicate rows in MySQL 8, each with its own set of trade-offs. The choice of method will depend on factors such as table size, database performance impact, and whether the table has related constraints or triggers. Maintaining data integrity by avoiding duplicates is vital, so these methods should be included as part of regular database housekeeping tasks. Testing these methods in a development or staging environment before applying them to a production database is highly recommended.