Postgresql – Get a truly RANDOM row from a PostgreSQL table quickly

postgresqlpostgresql-performancerandom

I always used to do:

SELECT column FROM table ORDER BY random() LIMIT 1;

For large tables, this was unbearably, impossibly slow, to the point of being useless in practice. That's why I started hunting for more efficient methods. People recommended:

SELECT column FROM table TABLESAMPLE BERNOULLI(1) LIMIT 1;

While fast, it also provides worthless randomness. It appears to always pick the same damn records, so this is also worthless.

I've also tried:

SELECT column FROM table TABLESAMPLE BERNOULLI(100) LIMIT 1;

It gives even worse randomness. It picks the same few records every time. This is completely worthless. I need actual randomness.

Why is it apparently so difficult to just pick a random record? Why does it have to grab EVERY record and then sort them (in the first case)? And why do the "TABLESAMPLE" versions just grab the same stupid records all the time? Why aren't they random whatsoever? Who would ever want to use this "BERNOULLI" stuff when it just picks the same few records over and over? I can't believe I'm still, after all these years, asking about grabbing a random record… it's one of the most basic possible queries.

What is the actual command to use for grabbing a random record from a table in PG which isn't so slow that it takes several full seconds for a decent-sized table?

Best Answer

Interesting question - which has many possibilities/permutations (this answer has been extensively revised).

Basically, this problem can be divided into two main streams.

  • A single random record

  • Multiple random records (not in the question - see reference and discussion at bottom)

Having researched this, I believe that the fastest solution to the single record problem is via the tsm_system_rows extension to PostgreSQL provided by Evan Carroll's answer.

If you're using a binary distribution, I'm not sure, but I think that the contrib modules (of which tsm_system_rows is one) are available by default - at least they were for the EnterpriseDB Windows version I used for my Windows testing (see below). My main testing was done on 12.1 compiled from source on Linux (make world and make install-world).

The reason why I feel that it is best for the single record use case is that the only problem mentioned concerning this extension is that:

Like the built-in SYSTEM sampling method, SYSTEM_ROWS performs block-level sampling, so that the sample is not completely random but may be subject to clustering effects, especially if only a small number of rows are requested.

however, since you are only interested in selecting 1 row, the block-level clustering effect should not be an issue. This article from 2ndQuadrant shows why this shouldn't be a problem for a sample of one record! It is a major problem for small subsets (see end of post) - OR if you wish to generate a large sample of random records from one large table (again, see the discussion of tsm_system_rows and tsm_system_time below).

Then I created and populated a table like this:

CREATE TABLE rand AS SELECT generate_series(1, 100000000) AS seq, MD5(random()::text);

So, I now have a table with 100,000,000 (100 million) records. Then I added a PRIMARY KEY:

ALTER TABLE rand ADD PRIMARY KEY (seq);

So, now to SELECT random records:

SELECT LENGTH((seq/100)::TEXT), seq/100::FLOAT, md5 FROM rand TABLESAMPLE SYSTEM_ROWS(1);

Notice that I have used a slightly modified command so that I could "see" the randomness - I also set the \timing command so that I could get empirical measurements.

I used the LENGTH() function so that I could readily perceive the size of the PRIMARY KEY integer being returned. Here is a sample of records returned:

test=# SELECT LENGTH((seq/100)::TEXT), seq/100::FLOAT, md5 FROM rand TABLESAMPLE SYSTEM_ROWS(1);
 length | ?column?  |               md5                
--------+-----------+----------------------------------
      6 | 970749.61 | bf18719016ff4f5d16ed54c5f4679e20
(1 row)

Time: 30.606 ms
test=# SELECT LENGTH((seq/100)::TEXT), seq/100::FLOAT, md5 FROM rand TABLESAMPLE SYSTEM_ROWS(1);
 length | ?column?  |               md5                
--------+-----------+----------------------------------
      6 | 512101.21 | d27fbeea30b79d3e4eacdfea7a62b8ac
(1 row)

Time: 0.556 ms
test=# SELECT LENGTH((seq/100)::TEXT), seq/100::FLOAT, md5 FROM rand TABLESAMPLE SYSTEM_ROWS(1);
 length | ?column?  |               md5                
--------+-----------+----------------------------------
      6 | 666476.41 | c7c0c34d59229bdc42d91d0d4d9d1403
(1 row)

Time: 0.650 ms
test=# SELECT LENGTH((seq/100)::TEXT), seq/100::FLOAT, md5 FROM rand TABLESAMPLE SYSTEM_ROWS(1);
 length | ?column? |               md5                
--------+----------+----------------------------------
      5 | 49152.01 | 0a2ff4da00a2b81697e7e465bd67d85c
(1 row)

Time: 0.593 ms
test=# SELECT LENGTH((seq/100)::TEXT), seq/100::FLOAT, md5 FROM rand TABLESAMPLE SYSTEM_ROWS(1);
 length | ?column? |               md5                
--------+----------+----------------------------------
      5 | 18061.21 | ee46adc96a6f8264a5c6614f8463667d
(1 row)

Time: 0.616 ms
test=# SELECT LENGTH((seq/100)::TEXT), seq/100::FLOAT, md5 FROM rand TABLESAMPLE SYSTEM_ROWS(1);
 length | ?column?  |               md5                
--------+-----------+----------------------------------
      6 | 691962.01 | 4bac0d051490c47716f860f8afb8b24a
(1 row)

Time: 0.743 ms

So, as you can see, the LENGTH() function returns 6 most of the time - this is to be expected as most records will be between 10,000,000 and 100,000,000, but there are a couple which show a value of 5 (also have seen values of 3 & 4 - data not shown).

Now, notice the timings. The first is 30 milliseconds (ms) but the rest are sub millisecond (approx. 0.6 - 0.7ms). Most of the random samples are returned in this sub-millisecond range, but, there are results returned in 25 - 30 ms (1 in 3 or 4 on average).

From time to time, this multi-millisecond result can occur twice or even three times in a row, but, as I said, the majority of results (approx. 66 - 75%) are sub-millisecond. None of the response times for my solution that I have seen has been in excess of 75ms.

During my research I also discovered the tsm_system_time extension which is similar to tsm_system_rows. Now, I also benchmarked this extension as follows:

SELECT LENGTH((seq/100)::TEXT), seq/100::FLOAT, md5 FROM rand TABLESAMPLE SYSTEM_TIME(0.001) LIMIT 1;

Note that the time quantum is 1/1000th of a millisecond which is a microsecond - if any number lower than this is entered, no records are returned. However, interestingly, even this tiny quantum always returns 120 rows.

Quite why it's 120 is a bit above my pay grade - the PostgreSQL page size is 8192 (the default)

test=# SELECT current_setting('block_size');
 current_setting 
-----------------
 8192
(1 row)

and the file system block size is 4096

[pol@UNKNOWN inst]$blockdev --getbsz /dev/mapper/fedora_localhost--live-home 
4096

A record should be (1 INTEGER (4 bytes) + 1 UUID (16 bytes)) (= 20 bytes) + the index on the seq field (size?). 4096/120 = 34.1333... - I hardly think that each index entry for this table takes 14 bytes - so where the 120 comes from, I'm not sure.

I'm not quite sure if the LIMIT clause will always return the first tuple of the page or block - thereby introducing an element of non-randomness into the equation.

The performance of the tsm_system_time query is identical (AFAICS - data not shown) to that of the tsm_system_rows extension. The same caveat about not being sure whether there is an element of non-randomness introduced by how these extensions choose their first record also applies to the tsm_system_rows queries. See discussion and bench-testing of the (so-called) randomness of these two methods below.

With respect to performance, just for reference, I'm using a Dell Studio 1557 with a 1TB HDD (spinning rust) and 8GB of DDR3 RAM running Fedora 31). This is a 10 year old machine!

I also did the same thing on a machine (Packard Bell, EasyNote TM - also 10 years old, 8GB DDR3 RAM running Windows 2019 Server) that I have with an SSD (SSD not top of the range by any means!) and the response times are typically (strangely enough) a bit higher (~ 1.3 ms), but there are fewer spikes and the values of these are lower (~ 5 - 7 ms).

There could well be a lot of stuff running in the background with 2019 Server - but if you have a modern laptop with a decent SSD, there's no reason that you can't expect sub-millisecond response times as a matter of course!

All tests were run using PostgreSQL 12.1.

To check out the true "randomness" of both methods, I created the following table:

CREATE TABLE rand_samp 
(
  seq INT, 
  md5 TEXT
);

and then ran (3 times each):

DO
$$
DECLARE 
  i RECORD;
BEGIN
  FOR i IN 1..10000 LOOP
    INSERT INTO rand_samp (seq, md5)
    SELECT seq, md5 FROM rand TABLESAMPLE SYSTEM_ROWS(1);  
  END LOOP;
END;
$$
;

and also using (in the inner loop of the above function)

SELECT seq, md5 FROM rand TABLESAMPLE SYSTEM_TIME(0.001) LIMIT 1;

Then after each run, I queried my rand_samp table:

SELECT 
  seq, COUNT(seq) 
FROM rand_samp 
GROUP BY seq 
HAVING COUNT(seq) > 1;

And got the following counts:

For TABLESAMPLE SYSTEM_ROWS, I got 258, 63, 44 dupes, all with a count of 2. For TABLESAMPLE SYSTEM_TIME, I got 46, 54 and 62, again all with a count of 2.

Now, my stats are a bit rusty, but from a random sample of a table of 100M records,from a sample of 10,000, (1 ten-thousandth of the number of records in the rand table), I'd expect a couple of duplicates - maybe from time to time, but nothing like the numbers I obtained. Furthermore, if there was true randomness, I'd expect (a small number of) 3's and 4's also.

I ran two tests with 100,000 runs using TABLESAMPLE SYSTEM_ROWS and obtained 5540 dupes (~ 200 with 3 dupes and 6 with 4 dupes) on the first run, and 5465 dupes on the second (~ 200 with 3 and 6 with 4). The most interesting query was this however:

SELECT COUNT(s.seq)
FROM rand_samp s
WHERE s.seq IN (SELECT sb.seq FROM rand_samp_bis sb);

where I compare dupes in both runs of 100,000 with respect to each other - the answer is a whopping 11,250 (> 10%) are the same - which for a sample of 1 thousandth (1/1000) is WAY to much to be down to chance!

Results 100,000 runs for SYSTEM_TIME - 5467 dupes, 215 with 3, and 9 with 4 on the first group, 5472, 210 (3) and 12 (4) with the second. The number of matching records is 11,328 (again > 10%).

There's clearly (a LOT of) non-random behaviour going on. I'll leave it to the OP to decide if the speed/random trade-off is worth it or not!

Benchmark of other answers.

I decided to benchmark the other proposed solutions - using my 100 million record table from above. I ran all tests 5 times - ignoring any outliers at the beginning of any series of tests to eliminate cache/whatever effects. All the outlier values were higher than those reported below.

I'm using the machine with the HDD - will test with the SSD machine later. The .mmm reported means milliseconds - not significant for any answer but my own.

Daniel Vérité's answer:

SELECT * FROM
  (SELECT seq FROM rand TABLESAMPLE BERNOULLI(1)) AS s
 ORDER BY RANDOM() LIMIT 1;

Ran 5 times - all times were over a minute - typically 01:00.mmm (1 at 01:05.mmm).

Typical run:

test=# SELECT * FROM
  (SELECT seq FROM rand TABLESAMPLE BERNOULLI(1)) AS s
 ORDER BY RANDOM() LIMIT 1;
   seq   
---------
 9529212
(1 row)

Time: 60789.988 ms (01:00.790)

Swav's answer:

SELECT md5 FROM rand OFFSET (
    SELECT floor(random() * (SELECT count(seq) from rand))::int
) limit 1;

Ran 5 times - all times were over a minute - from 01:03 to 01:29

Typical run:

test=# SELECT md5 FROM rand OFFSET (
    SELECT floor(random() * (SELECT count(seq) from rand))::int
) limit 1;
               md5                
----------------------------------
 8004dfdfbaa9ac94243c33e9753e1f77
(1 row)

Time: 68558.096 ms (01:08.558)

Colin 't Hart's answer:

select * from rand where seq >= (
  select random()*(max(seq)-min(seq)) + min(seq) from rand
)
order by seq
limit 1;

Ran 5 times - times varied between 00:06.mmm and 00:14.mmm (Best of the Rest!)

Typical run:

test=# select * from rand where seq >= (
  select random()*(max(seq)-min(seq)) + min(seq) from rand
)
order by seq
limit 1;
   seq    |               md5                
----------+----------------------------------
 29277339 | 2b27c594f65659c832f8a609c8cf8e78
(1 row)

Time: 6944.771 ms (00:06.945)

Colin 't Hart's 2nd answer (adapted by me):

WITH min_max AS MATERIALIZED -- or NOT, doesn't appear to make a difference
(
  SELECT MIN(seq) AS min_s, MAX(seq) AS max_s, (MAX(seq) - MIN(seq)) - MIN(seq) AS diff_s
  FROM rand
),
other  AS MATERIALIZED
(
  SELECT FLOOR(RANDOM() * (SELECT diff_s FROM min_max))::INT AS seq_val
)
SELECT seq, md5 
FROM rand
WHERE seq = (SELECT seq_val FROM other);

Response time is between ~ 30 - 45ms with the odd outlier on either side of those times - it can even drop to 1.xxx ms from time to time. All I can really say is that it appears to be more consistent than either of the SYSTEM_TIME and SYSTEM_ROWS methods.

There is a major problem with this method however. If the underlying field that one is choosing for randomness is sparse, then this method won't return a value all of the time - this may or may not be acceptable to the OP? You can do something like (end of query):

SELECT seq, md5 
FROM rand
WHERE seq >= (SELECT seq_val FROM other)
LIMIT 1;

(note >= and LIMIT 1). This can be very efficient, (1.xxx ms), but seems to vary more than just the seq =... formulation - but once the cache appears to be warmed up, it regularly gives response times of ~ 1.5ms.

Another advantage of this solution is that it doesn't require any special extensions which, depending on the context (consultants not being allowed install "special" tools, DBA rules...) may not be available.

One really WEIRD thing about the above solution is that if the ::INT CAST is removed, the query takes ~ 1 minute. This happens even though the FLOOR function should return an INTEGER. I only discovered that this was an issue by running EXPLAIN (ANALYZE BUFFERS).

With ::INT

   CTE other
     ->  Result  (cost=0.02..0.04 rows=1 width=4) (actual time=38.906..38.907 rows=1 loops=1)
           Buffers: shared hit=1 read=9
           InitPlan 4 (returns $3)
             ->  CTE Scan on min_max  (cost=0.00..0.02 rows=1 width=4) (actual time=38.900..38.902 rows=1 loops=1)
                   Buffers: shared hit=1 read=9
   InitPlan 6 (returns $5)
     ->  CTE Scan on other  (cost=0.00..0.02 rows=1 width=4) (actual time=38.909..38.910 rows=1 loops=1)
           Buffers: shared hit=1 read=9
 Planning Time: 0.329 ms
 Execution Time: 68.449 ms
(31 rows)

Time: 99.708 ms
test=#

Without ::INT

   CTE other
     ->  Result  (cost=0.02..0.04 rows=1 width=8) (actual time=0.082..0.082 rows=1 loops=1)
           Buffers: shared hit=10
           InitPlan 4 (returns $3)
             ->  CTE Scan on min_max  (cost=0.00..0.02 rows=1 width=4) (actual time=0.076..0.077 rows=1 loops=1)
                   Buffers: shared hit=10
   InitPlan 6 (returns $5)
     ->  CTE Scan on other  (cost=0.00..0.02 rows=1 width=8) (actual time=0.085..0.085 rows=1 loops=1)
           Buffers: shared hit=10
   ->  Parallel Seq Scan on rand  (cost=0.00..1458334.00 rows=208333 width=37) (actual time=52644.672..60025.906 rows=0 loops=3)
         Filter: ((seq)::double precision = $5)
         Rows Removed by Filter: 33333333
         Buffers: shared hit=14469 read=818865
 Planning Time: 0.378 ms
 Execution Time: 60259.401 ms
(37 rows)

Time: 60289.827 ms (01:00.290)
test=#

Note the (without ::INT)

   ->  Parallel Seq Scan on rand  (cost=0.00..1458334.00 rows=208333 width=37) (actual time=52644.672..60025.906 rows=0 loops=3)
         Filter: ((seq)::double precision = $5)

Parallel Seq Scan (with a high cost), filter on (seq)::double

(WHY double??).

And

Buffers: shared hit=14469 read=818865

compared to (with ::INT)

Buffers: shared hit=1 read=9

Finally, my own answer again (same machine, time & cache):

(this is now redundant in the light of the benchmarking performed above).

Ran my own benchmark again 15 times - typically times were sub-millisecond with the occasional (approx. 1 in 3/4) run taking approx. 25 milliseconds.

Typical run:

test=# SELECT LENGTH((seq/100)::TEXT), seq/100::FLOAT, md5 FROM rand TABLESAMPLE SYSTEM_ROWS(1);
 length | ?column?  |               md5                
--------+-----------+----------------------------------
      6 | 401578.81 | 30ff7ecfedea088dab75932f0b1ea872
(1 row)

Time: 0.708 ms

So, it would appear that my solution's worst times are ~ 200 times faster than the fastest of the rest of the pack's answers (Colin 't Hart).

My analysis is that there is no perfect solution, but the best one appears to be the adaptation of Colin 't Hart's solution.

Finally, a GRAPHIC demonstration of the problem associated with using this solution for more than one record is shown below - taking a sample of 25 records (performed several times - typical run shown).

The tsm_system_rows method will produce 25 sequential records. This may be suitable for certain purposes where the fact that the random sample is a number of sequential records isn't a problem, but it's definitely worth keeping in mind.

test=# SELECT LENGTH((seq/100)::TEXT), seq/100::FLOAT, md5 FROM rand TABLESAMPLE SYSTEM_ROWS(25);
 length | ?column?  |               md5                
--------+-----------+----------------------------------
      6 | 763140.01 | 7e84b36ab30d3d2038ebd832c241b54d
      6 | 763140.02 | a976e258f6047df18e8ed0559ff48c36
--
--    SEQUENTIAL values of seq!
--
      6 | 763140.23 | ad4f1b4362187d6a300aaa9aaef30170
      6 | 763140.24 | 0c7fcc4f07d27fbcece68b2503f28d88
      6 | 763140.25 | 64d4507b18b5481a724d8a5bb6ac59c8
(25 rows)

Time: 29.348 ms

A similar state of affairs pertains in the case of the SYSTEM_TIME method. As mentioned above, even with a minimum time of 1μs, it gives 120 records. Just as with SYSTEM_ROWS, these give sequential values of the PRIMARY KEY.

test=# SELECT seq, md5 FROM rand TABLESAMPLE SYSTEM_TIME(0.001);

returns:

   seq    |               md5                
----------+----------------------------------
 42392881 | e92f15cba600f0c7aa16db98c0183828
 42392882 | 93db51ea870e15202144d11810c8f40c
 42392883 | 7357bf0cf1fa23ab726e642832bb87b0
 42392884 | 1f5ce45fb17c8ba19b391f9b9c835242
 42392885 | f9922b502d4fd9ee84a904ac44d4e560
 ...
 ...  115 sequential values snipped for brevity!

Our sister site, StackOverflow, treated this very issue here. Good answers are provided by (yet again) Erwin Brandstetter here and Evan Carroll here. That whole thread is worth reading in detail - since there are different definitions of random (monotonically increasing/decreasing, Pseudorandom number generators...) and sampling (with or without replacement...).