2 Ways to Select N Random Rows in PostgreSQL

Updated: January 4, 2024 By: Guest Contributor Post a comment

Introduction

Selecting random rows from a database can be a common requirement for tasks such as sampling, testing, or balancing load on datasets. When using PostgreSQL, there are several methods available that can accomplish this task. Each method has its own advantages and disadvantages that might make it more suitable for specific scenarios. In this tutorial, we will explore different ways to select N random rows in PostgreSQL.

Method 1: ORDER BY RANDOM()

The most straightforward method to select random rows in PostgreSQL is to use the ORDER BY RANDOM() clause, which orders the entire table at random and then limits the query to the first N rows. This method is simple to write and understand.

Steps to Implement:

  1. Use the SELECT statement to define the columns you want to retrieve.
  2. Add an ORDER BY RANDOM() clause to randomize the rows.
  3. Limit the result set to N rows using the LIMIT clause.

Code Example:

SELECT *
FROM your_table
ORDER BY RANDOM()
LIMIT N;

Advantages:

  • Very easy to implement.
  • No need for extra setup or indexing on the table.

Limitations:

  • Not efficient for large tables as it sorts the entire table.
  • Can be slow because it generates a random value for each row before ordering.

Method 2: TABLESAMPLE SYSTEM

TABLESAMPLE is a clause in PostgreSQL that allows you to select a random sample of data from a table. The SYSTEM sampling method is typically faster than ORDER BY RANDOM(), yet it provides less randomness. It’s more suitable for larger tables where performance is a concern.

Steps to Implement:

  1. Use the SELECT statement along with the TABLESAMPLE SYSTEM clause.
  2. Provide the percentage of rows to sample based on the desired number of random rows and table size.
  3. Enclose the above within another SELECT statement if an exact number of rows is needed.

Code Example:

SELECT *
FROM
  (SELECT * FROM your_table TABLESAMPLE SYSTEM (percentage)) as subquery
LIMIT N;

Advantages:

  • More performant than ORDER BY RANDOM() for large tables.

Limitations:

  • Lesser degree of randomness as this method is block-based.
  • Figuring out the correct percentage for the desired number of rows can be tricky.

Conclusion

PostgreSQL provides multiple ways to select a set of random rows, with varying balance between ease of implementation and system performance. The basic ORDER BY RANDOM() approach is best suited for smaller datasets where query performance is not a major concern. For larger datasets, considering alternatives such as the TABLESAMPLE method is advised, keeping in mind the tradeoff between randomness and performance. It is important to test these solutions on your data and query planner to ensure they meet both your performance and randomness needs before integrating them into your production code.