In the following example I have a table foo
from which I'd like to pick out at random a row per group.
CREATE TABLE foo (
line INT
);
INSERT INTO foo (line)
SELECT generate_series(0, 999, 1);
Let's say that I'd like to group by line % 10
. I could do this with:
SELECT DISTINCT ON (bin) bin, line
FROM (
SELECT line, line % 10 AS bin, random() x
FROM foo
ORDER BY x
) X
What I'd like to do is get random picks from each bin several times. I had thought I'd be able to do this with generate_series()
and LATERAL
SELECT i, line, bin
FROM
(
SELECT generate_series(1,3) i
) m,
LATERAL
(SELECT DISTINCT ON (bin) bin, line
FROM (
SELECT line, line % 10 bin, random() x
FROM foo
ORDER BY x
) X
ORDER BY bin) Q
ORDER BY bin, i;
However, when I do this in PostgreSQL 9.5 I find I get the same line
for a given bin
for each iteration i
, e.g.,
i;line;bin
1;530;0
2;530;0
3;530;0
1;611;1
2;611;1
3;611;1
...
I'm confused, as I thought the subquery containing the random()
would be run differently for each line from the generate_series()
.
EDIT: I realised that I can achieve the same objective by generating more combinations and choosing from these with
SELECT DISTINCT ON (bin, round) round, bin, line
FROM (
SELECT line, line % 10 as bin, round
FROM foo, generate_series(1,3) round
ORDER BY bin, random()
) X;
So my question is simply why didn't the first way work?
EDIT: The problem appears to be that LATERAL only acts like a for-loop if the subqueries are correlated in some way (thanks to @ypercube's comment). Hence my original approach can be fixed by adding the following small change
SELECT i, line, bin
FROM
(
SELECT generate_series(1,3) i
) m,
LATERAL
(
SELECT DISTINCT ON (bin) bin, line
FROM (
SELECT line, line % 10 bin, m.i, random() x -- <NOTE m.i HERE
FROM foo
ORDER BY x
) X
ORDER BY bin
LIMIT 3
) Q
ORDER BY bin, i;
Best Answer
I'd write the query like this, using
LIMIT (3)
instead ofDISTINCT ON
.The
generate_series(0, 9)
is used to get all the distinct bins. You could use(SELECT DISTINCT line % 10 FROM foo) AS g (bin)
instead, if the "bins" are not all the integers from 0 up to 9:Also, if you don't need the
random()
number in the output, you could useORDER BY random()
in the subquery and removex
from the select and order by clauses - or replaceORDER BY d.x
withORDER BY d.line
.