PostgreSQL: Selecting the first row in each GROUP BY group

Updated: January 6, 2024 By: Guest Contributor Post a comment

Introduction

Efficient data analysis often requires aggregating records and selecting distinct entries in SQL. In PostgreSQL, accomplishing this can be achieved using various methods. This guide will explore how to select the first row within each GROUP BY group using multiple techniques.

Using DISTINCT ON

The DISTINCT ON clause is a PostgreSQL-specific extension that allows you to return the first row for each set of distinct column values.

SELECT DISTINCT ON (column_to_group) *
FROM your_table
ORDER BY column_to_group, criteria_column;

It is important to include an ORDER BY clause that first lists the column you’re grouping by, followed by the criteria you want to use to determine the first row.

Example

SELECT DISTINCT ON (department) *
FROM employees
ORDER BY department, join_date;

This will return the earliest-joined employee for each department.

Window Functions

PostgreSQL’s window functions allow us to perform calculations across rows related to our current row. To pick the first row per group, we use ROW_NUMBER().

WITH RankedEmployees AS (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY department ORDER BY join_date) AS rn
    FROM employees
)
SELECT *
FROM RankedEmployees
WHERE rn = 1;

This CTE assigns a unique row number for each row within a department based on the join_date and then selects only the first rows.

Aggregate Functions

Aggregate functions can be combined with a JOIN to select the first row per group.

SELECT a.*
FROM employees a
JOIN (
    SELECT department, MIN(join_date) AS min_join_date
    FROM employees
    GROUP BY department
) b ON a.department = b.department AND a.join_date = b.min_join_date;

We first find the earliest join date for each department, and then join this result back to the original employees table to select the entire row.

Subqueries

Subqueries can also be used to isolate the first row per group, especially when the condition for ‘first’ is complex.

SELECT *
FROM employees a
WHERE a.join_date = (
    SELECT MIN(b.join_date)
    FROM employees b
    WHERE a.department = b.department
);

Here, for each employee, we find the earliest join date within the same department and use it to filter our results.

Advanced Use Cases

PostgreSQL also allows us to use these techniques in conjunction with JOINs, subselects, and even within UPDATE statements, providing a powerful set of tools for data manipulation.

UPDATE employees SET status = 'Oldest in Department'
WHERE id IN (
    SELECT id
    FROM (
        SELECT id, ROW_NUMBER() OVER (PARTITION BY department ORDER BY join_date) AS rn
        FROM employees
    ) subquery
    WHERE rn = 1
);

This command marks the longest-serving employee in each department with a special status.

Performance Considerations

When using these methods, it’s important to consider the size of your dataset and how indexes can be used to optimize performance. Understanding the execution plan using EXPLAIN can help you choose the best method.

Conclusion

There are many ways to select the first row in each GROUP BY group in PostgreSQL. Whether you use DISTINCT ON, window functions, aggregate functions with joins, or subqueries depends on your situation and performance requirements. With these tools in hand, you can now slice and dice your data in ways that can yield insightful analysis and efficient database operation.