How to delete duplicate rows in PostgreSQL

Introduction
Understanding Duplicate Rows
Preparation before Deletion
Basic Deletion – Using DISTINCT
Intermediate Deletion – DELETE with Subquery
Advanced Deletion – Using CTEs
Ensuring Uniqueness Moving Forward
Maintaining Performance
Conclusion

Introduction

Dealing with duplicate data is a common challenge in database management. In PostgreSQL, we can address this issue through a variety of techniques to ensure data integrity and optimization.

Understanding Duplicate Rows

In PostgreSQL, a row is considered a duplicate of another when all or a specified set of their corresponding columns match. Duplicate rows can be problematic as they may cause misleading results in analyses, increase storage consumption, and more.

Preparation before Deletion

Before proceeding to delete duplicate rows, it’s wise to identify and examine them. The SELECT statement with the DISTINCT keyword helps find unique records, and the COUNT() function can verify the existence of duplicates:

SELECT column1, column2, COUNT(*) FROM your_table GROUP BY column1, column2 HAVING COUNT(*) > 1;

Basic Deletion – Using DISTINCT

To remove duplicates, you can create a new table with distinct records and then swap it with the original:

CREATE TABLE new_table AS
SELECT DISTINCT * FROM your_table;
DROP TABLE your_table;
ALTER TABLE new_table RENAME TO your_table;

Intermediate Deletion – DELETE with Subquery

For tables with a unique key, you can delete duplicates by using a subquery that identifies rows with a lower unique identifier for the same set of values:

DELETE FROM your_table
WHERE id IN (
  SELECT id
  FROM (
    SELECT id, ROW_NUMBER() OVER (PARTITION BY column1, column2 ORDER BY id) AS rnum
    FROM your_table
  ) t
  WHERE t.rnum > 1
);

Advanced Deletion – Using CTEs

Common Table Expressions (CTEs) allow for more readable and modular code. Here’s how to delete duplicates with a CTE:

WITH duplicates AS (
  SELECT id, ROW_NUMBER() OVER (PARTITION BY column1, column2 ORDER BY id) AS rnum
  FROM your_table
)
DELETE FROM your_table
WHERE id IN (SELECT id FROM duplicates WHERE rnum > 1);

Ensuring Uniqueness Moving Forward

To prevent future duplicates, consider adding a unique constraint to your table, if applicable:

ALTER TABLE your_table ADD CONSTRAINT unique_constraint UNIQUE (column1, column2);

Maintaining Performance

Deleting duplicates can take time, especially with large tables. To minimize disruption, perform deletions during off-peak hours or batch the deletion process.

Conclusion

Managing duplicates efficiently is critical for database health. Through various techniques, PostgreSQL provides us with the means to handle repeated data gracefully. Remember to always take precautions like backups when performing operations that alter large datasets.

Next Article: PostgreSQL: Efficiently count rows in an extremely large table

Previous Article: How to Select the NTH Row in PostgreSQL

Series: PostgreSQL Tutorials: From Basic to Advanced

PostgreSQL