Sling Academy
Home/PostgreSQL/How to delete duplicate rows in PostgreSQL

How to delete duplicate rows in PostgreSQL

Last updated: January 06, 2024

Introduction

Dealing with duplicate data is a common challenge in database management. In PostgreSQL, we can address this issue through a variety of techniques to ensure data integrity and optimization.

Understanding Duplicate Rows

In PostgreSQL, a row is considered a duplicate of another when all or a specified set of their corresponding columns match. Duplicate rows can be problematic as they may cause misleading results in analyses, increase storage consumption, and more.

Preparation before Deletion

Before proceeding to delete duplicate rows, it’s wise to identify and examine them. The SELECT statement with the DISTINCT keyword helps find unique records, and the COUNT() function can verify the existence of duplicates:

SELECT column1, column2, COUNT(*) FROM your_table GROUP BY column1, column2 HAVING COUNT(*) > 1;

Basic Deletion – Using DISTINCT

To remove duplicates, you can create a new table with distinct records and then swap it with the original:

CREATE TABLE new_table AS
SELECT DISTINCT * FROM your_table;
DROP TABLE your_table;
ALTER TABLE new_table RENAME TO your_table;

Intermediate Deletion – DELETE with Subquery

For tables with a unique key, you can delete duplicates by using a subquery that identifies rows with a lower unique identifier for the same set of values:

DELETE FROM your_table
WHERE id IN (
  SELECT id
  FROM (
    SELECT id, ROW_NUMBER() OVER (PARTITION BY column1, column2 ORDER BY id) AS rnum
    FROM your_table
  ) t
  WHERE t.rnum > 1
);

Advanced Deletion – Using CTEs

Common Table Expressions (CTEs) allow for more readable and modular code. Here’s how to delete duplicates with a CTE:

WITH duplicates AS (
  SELECT id, ROW_NUMBER() OVER (PARTITION BY column1, column2 ORDER BY id) AS rnum
  FROM your_table
)
DELETE FROM your_table
WHERE id IN (SELECT id FROM duplicates WHERE rnum > 1);

Ensuring Uniqueness Moving Forward

To prevent future duplicates, consider adding a unique constraint to your table, if applicable:

ALTER TABLE your_table ADD CONSTRAINT unique_constraint UNIQUE (column1, column2);

Maintaining Performance

Deleting duplicates can take time, especially with large tables. To minimize disruption, perform deletions during off-peak hours or batch the deletion process.

Conclusion

Managing duplicates efficiently is critical for database health. Through various techniques, PostgreSQL provides us with the means to handle repeated data gracefully. Remember to always take precautions like backups when performing operations that alter large datasets.

Next Article: PostgreSQL: Efficiently count rows in an extremely large table

Previous Article: How to Select the NTH Row in PostgreSQL

Series: PostgreSQL Tutorials: From Basic to Advanced

PostgreSQL

You May Also Like

  • PostgreSQL with TimescaleDB: Querying Time-Series Data with SQL
  • PostgreSQL Full-Text Search with Boolean Operators
  • Filtering Stop Words in PostgreSQL Full-Text Search
  • PostgreSQL command-line cheat sheet
  • How to Perform Efficient Rolling Aggregations with TimescaleDB
  • PostgreSQL with TimescaleDB: Migrating from Traditional Relational Models
  • Best Practices for Maintaining PostgreSQL and TimescaleDB Databases
  • PostgreSQL with TimescaleDB: Building a High-Performance Analytics Engine
  • Integrating PostgreSQL and TimescaleDB with Machine Learning Models
  • PostgreSQL with TimescaleDB: Implementing Temporal Data Analysis
  • Combining PostgreSQL, TimescaleDB, and Airflow for Data Workflows
  • PostgreSQL with TimescaleDB: Visualizing Real-Time Data with Superset
  • Using PostgreSQL with TimescaleDB for Energy Consumption Analysis
  • PostgreSQL with TimescaleDB: How to Query Massive Datasets Efficiently
  • Best Practices for Writing Time-Series Queries in PostgreSQL with TimescaleDB
  • PostgreSQL with TimescaleDB: Implementing Batch Data Processing
  • Using PostgreSQL with TimescaleDB for Network Traffic Analysis
  • PostgreSQL with TimescaleDB: Troubleshooting Common Performance Issues
  • Building an IoT Data Pipeline with PostgreSQL and TimescaleDB