How to delete duplicate rows in PostgreSQL

Updated: January 6, 2024 By: Guest Contributor Post a comment

Introduction

Dealing with duplicate data is a common challenge in database management. In PostgreSQL, we can address this issue through a variety of techniques to ensure data integrity and optimization.

Understanding Duplicate Rows

In PostgreSQL, a row is considered a duplicate of another when all or a specified set of their corresponding columns match. Duplicate rows can be problematic as they may cause misleading results in analyses, increase storage consumption, and more.

Preparation before Deletion

Before proceeding to delete duplicate rows, it’s wise to identify and examine them. The SELECT statement with the DISTINCT keyword helps find unique records, and the COUNT() function can verify the existence of duplicates:

SELECT column1, column2, COUNT(*) FROM your_table GROUP BY column1, column2 HAVING COUNT(*) > 1;

Basic Deletion – Using DISTINCT

To remove duplicates, you can create a new table with distinct records and then swap it with the original:

CREATE TABLE new_table AS
SELECT DISTINCT * FROM your_table;
DROP TABLE your_table;
ALTER TABLE new_table RENAME TO your_table;

Intermediate Deletion – DELETE with Subquery

For tables with a unique key, you can delete duplicates by using a subquery that identifies rows with a lower unique identifier for the same set of values:

DELETE FROM your_table
WHERE id IN (
  SELECT id
  FROM (
    SELECT id, ROW_NUMBER() OVER (PARTITION BY column1, column2 ORDER BY id) AS rnum
    FROM your_table
  ) t
  WHERE t.rnum > 1
);

Advanced Deletion – Using CTEs

Common Table Expressions (CTEs) allow for more readable and modular code. Here’s how to delete duplicates with a CTE:

WITH duplicates AS (
  SELECT id, ROW_NUMBER() OVER (PARTITION BY column1, column2 ORDER BY id) AS rnum
  FROM your_table
)
DELETE FROM your_table
WHERE id IN (SELECT id FROM duplicates WHERE rnum > 1);

Ensuring Uniqueness Moving Forward

To prevent future duplicates, consider adding a unique constraint to your table, if applicable:

ALTER TABLE your_table ADD CONSTRAINT unique_constraint UNIQUE (column1, column2);

Maintaining Performance

Deleting duplicates can take time, especially with large tables. To minimize disruption, perform deletions during off-peak hours or batch the deletion process.

Conclusion

Managing duplicates efficiently is critical for database health. Through various techniques, PostgreSQL provides us with the means to handle repeated data gracefully. Remember to always take precautions like backups when performing operations that alter large datasets.