How to efficiently select N random rows in MySQL 8

Updated: January 26, 2024 By: Guest Contributor Post a comment

Introduction

Random row selection in databases can be essential for various applications such as data sampling, A/B testing, or simply when you want to give your users a random set of items. However, fetching random rows can be challenging concerning performance, especially with large datasets. In this tutorial, we’ll explore multiple methods to select N random rows in MySQL 8 efficiently.

Basic Method: ORDER BY RAND()

The simplest way to retrieve random rows in MySQL is by using the ORDER BY RAND() clause. Here is the basic syntax to retrieve N random rows using this method:

SELECT * FROM table_name
ORDER BY RAND()
LIMIT N;

For example, to fetch 5 random rows from a table named users:

SELECT * FROM users
ORDER BY RAND()
LIMIT 5;

However, this method is not efficient for large tables, as RAND() generates a random value for each row, causing a full table scan.

Using a Derived Table with a LIMIT Clause

A slightly more efficient method involves using a derived table with a LIMIT clause and then ordering the results randomly:

SELECT t.* FROM (
    SELECT * FROM table_name
    LIMIT 1000
) AS t
ORDER BY RAND()
LIMIT N;

This method reduces the size of the result set before applying the RAND() function. While still not perfect, it improves performance on larger tables but might not yield truly random results on the entire dataset.

Random Sampling with Primary Key Ranges

For tables with an auto-increment primary key, a more efficient method is to generate random numbers within the ID range:

SET @min = (SELECT MIN(id) FROM table_name);
SET @max = (SELECT MAX(id) FROM table_name);
SET @range = @max - @min;

SELECT *
FROM table_name
WHERE id >= (@min + (@range + 1 - @min) * RAND())
LIMIT N;

This technique avoids the full table scan, but it assumes there is an even distribution of rows within the primary key range, which might not always be the case.

Join-based Approach for Non-Sequential IDs

In scenarios where the ID distribution is sparse or non-sequential, a more robust solution uses a random join:

SELECT table_name.*
FROM (
    SELECT id
    FROM table_name
    WHERE RAND() < (SELECT ((N / COUNT(*)) * 10)
                     FROM table_name)
    ORDER BY RAND()
    LIMIT N
) AS random_ids
JOIN table_name ON table_name.id = random_ids.id;

This code first calculates the fraction of the table to sample, multiplies by a constant factor (like 10) to increase the selection size, and then filters IDs using this threshold. The random set is then joined back to the original table.

Optimized Large Dataset Sampling: Using TABLESAMPLE

In MySQL 8.0.19 and later, the TABLESAMPLE clause has been introduced to provide an efficient method of sampling an approximate percentage of rows:

SELECT *
FROM table_name
TABLESAMPLE SYSTEM (1); 

The SYSTEM keyword specifies that the sample is taken using available storage engine statistics. To fetch a specific number of rows, the percentage can be approximated based on the total row count. However, this approach provides an estimated number of rows, not an exact count.

Combining Multiple Techniques

For even better efficiency, particularly for very large datasets, you can select N random rows from a table by first narrowing down the dataset and then applying random ordering to this subset:

SELECT * FROM (
    SELECT * FROM your_table
    WHERE some_condition  -- Narrow down your dataset based on a condition
    ORDER BY RAND()       -- Randomize the order
    LIMIT 1000            -- Limit to a manageable subset
) AS subset
ORDER BY RAND()           -- Randomize the order again
LIMIT 10;                 -- Finally, select N random rows (e.g., 10)


Inner Query:

  • SELECT * FROM your_table WHERE some_condition: This part of the query is where you narrow down your data. Replace some_condition with a condition suitable for your dataset. This step is crucial for performance, especially with large tables.
  • ORDER BY RAND(): Randomizes the order of the rows.
  • LIMIT 1000: Limits the result set to a smaller subset (in this example, 1000 rows). Adjust this number based on your dataset size and needs.

Outer Query:

  • The outer query further randomizes (ORDER BY RAND()) the rows from the inner query’s subset.
  • LIMIT 10: Finally, this limits the result to N rows, in this case, 10. Adjust this to how many random rows you need.

Conclusion

In conclusion, while the ORDER BY RAND() method is easy to use, it is not suited for large tables due to performance issues. Depending on the table structure and the size of the dataset, more efficient methods like random primary key ranges, join-based approaches, and the TABLESAMPLE clause should be considered for better performance.