PostgreSQL – Random Function Returns Duplicate Results When Ordered by Random

postgresqlrandomsorting

I'm using postgresql 11.1-1 on Windows 10. On my system, if I run a query like the following, I will get a different number for each random column:

postgres=# SELECT random() as sortOrder, random() as col2;
     sortorder     |       col2
-------------------+-------------------
 0.607938482426107 | 0.121234225574881
(1 row)

But if I add an ORDER BY clause, as shown below, then random returns the same random number each time for every row.

postgres=# SELECT random() as sortOrder, random() as col2 
               FROM generate_series(0,1) 
               ORDER BY sortOrder;
     sortorder     |       col2
-------------------+-------------------
 0.100375576410443 | 0.100375576410443
 0.170669795479625 | 0.170669795479625
(2 rows)

I'm assuming there is a technical reason for this, but is it possible in this last query to get a different random value for each column?

Edit:
I may have oversimplified in my original question. The actual query I was attempting was more like the following:

SELECT column1, random(), column2, random()
               FROM table1
               ORDER BY random();

So, having an answer that can be combined with a table query is also important.

Best Answer

random() is volatile, and the doc says about volatility:

A query using a volatile function will re-evaluate the function at every row where its value is needed.

Based on this definition, the engine is correct in evaluating the function only once per row, not once per column for the same row.

To turn off this optimization, you might tweak the query for the optimizer to miss the fact that the two columns carry the same expression. For instance:

postgres=> SELECT random()+0 as sortOrder, random() as col2 
               FROM generate_series(0,1) 
               ORDER BY sortOrder;
     sortorder      |        col2         
--------------------+---------------------
 0.0412265718914568 | 0.00992022315040231
  0.841067179106176 |   0.818967831321061
(2 rows)

Now, that simple solution may seem a bit fragile. In some future version, the optimizer might theorically become smart enough to recognize that these two expressions have a common sub-expression that it can reuse.

A really robust solution would be to generate the columns in different rows to start with, in order to force the optimizer to use distinct function calls no matter how hard it tries to avoid the multiple evaluations. For instance:

SELECT v1,v2 FROM
(select random() as v1, row_number() over() as r1
 FROM generate_series(0,1)) as s1
JOIN
(select random() as v2, row_number() over() as r2
 FROM generate_series(0,1)) as s2
ON s1.r1=s2.r2
ORDER BY 1;
        v1         |        v2         
-------------------+-------------------
 0.145532250870019 | 0.963712519966066
 0.277611976023763 | 0.252654927782714
(2 rows)

Related Solutions

Postgresql – For each record, count records dated earlier in PostgreSQL

You can use the RANK() function for this:

SELECT 
    col1, col2, ..., year,
    RANK() OVER (ORDER BY year) - 1
      AS number_of_entries_prior_to_this_year
FROM
    tableX ;

Tested in SQLFiddle.

There is also a more complicated approach:

SELECT 
    col1, col2, year,
    COUNT(*) OVER (ORDER BY year) - COUNT(*) OVER (PARTITION BY year)
      AS number_of_entries_prior_to_this_year
FROM
    tableX ;

PostgreSQL – Sampling Techniques for Random Data

In Postgres 9.3+, you can use the folowing:

select t.*
from 
  generate_series(1, 10) as x(i)
    cross join lateral
  ( select *, x.i
    from Table_Name
    -- where random() < 0.01
    order by random()
    limit 1 
  ) ;

which basically chooses 1 random row, 10 times.

In older versions, you can use a simple cross join (no lateral):

select t.*
from 
    generate_series(1, 1000) as x(i)
  cross join 
    Table_Name as t
    -- where random() < 0.01
    order by random()
    limit 10
  ) t ;

which creates a 1000-fold copy of the table (so each row is there 1000 times) and then chooses 10 rows with the same method as your query. If the number of copies (1000) is large enough compared to the wanted rows (10), the probabilities are almost equal to the probabilities you would have got with replacement.

Performance of this second query will of course be horrible, even with small tables.

Best Answer

Related Solutions

Postgresql – For each record, count records dated earlier in PostgreSQL

PostgreSQL – Sampling Techniques for Random Data

Related Question