The problem is that your 2nd query:
SELECT datasets.id
FROM datasets
GROUP BY datasets.id
ORDER BY date_added
LIMIT 25 ;
does not mean what you expect. It does give you the first 25 rows ordered by date_added
only because the id
is the primary key of the table, so the GROUP BY
can be removed without changing the result.
It seems however that the optimizer does not always remove the redundant GROUP BY
and thus it produces a different plan. I'm not sure why - the various features of the optimizer that do these simplifications are far from covering all cases.
You might get a better plan if you change the query to have matching GROUP BY
and ORDER BY
clauses:
SELECT d.id
FROM datasets AS d
GROUP BY d.date_added, d.id
ORDER BY d.date_added, d.id
LIMIT 25 ;
But in any case, my advice would be "don't use redundant / complicated syntax when there is a simpler one".
Now for the 3rd query, with the join, while the GROUP BY
method is working, you can rewrite it by using standard SQL window functions (ROW_NUMBER()
) or Postgres DISTINCT ON
or by joining to a derived table (which uses your very first query!, with minor details changed):
SELECT
d.id,
array_remove(array_agg(o.some_column), NULL) AS other_table
FROM
( SELECT d.id, d.date_added
FROM datasets AS d
ORDER BY d.date_added
LIMIT 25
) AS d
LEFT JOIN other_table AS o
ON o.id = d.id
GROUP BY d.date_added, d.id
ORDER BY d.date_added
LIMIT 25 ;
We could also avoid GROUP BY
completely (well, it's hidden in the inline subquery):
SELECT
d.id,
( SELECT array_remove(array_agg(o.some_column), NULL)
FROM other_table AS o
WHERE o.id = d.id
) AS other_table
FROM datasets AS d
ORDER BY d.date_added
LIMIT 25 ;
Both queries are written so that the plan produced will do the (fast) limit subquery first and then the join, avoiding a full table scan of either table.
If you need aggregate from more columns, a third method combines both of the above, using a correlated (LATERAL
) subquery in the FROM
clause:
SELECT
d.id,
o.other_table
-- more aggregates
FROM
( SELECT d.id, d.date_added
FROM datasets AS d
ORDER BY d.date_added
LIMIT 25
) AS d
LEFT JOIN LATERAL
( SELECT array_remove(array_agg(o.some_column), NULL) AS other_table
-- more aggregates
FROM other_table AS o
WHERE o.id = d.id
) AS o
ON TRUE
ORDER BY d.date_added
LIMIT 25 ;
I found it in the SQL 2011 spec...
If the <select list>
“*” is simply contained in a <table subquery>
that is immediately contained in an <exists predicate>
, then the <select list>
is equivalent to a <value expression>
that is an arbitrary <literal>
.
This confirms that by *
not being equivalent to an arbitrary literal in this context that it is in fact PostgreSQL breaking the spec.
Keep in mind this is a distinct problem from
SELECT *
FROM ( VALUES (1),(2),(3) ) AS t(x)
HAVING count(*) > 1
Which both databases reject.
PostgreSQL,
ERROR: column "t.x" must appear in the GROUP BY clause or be used in an aggregate function
SQL Server,
Column 't.x' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
Why this bug persists in PostgreSQL
Thanks goes to RhodiumToad on irc.freenode.net/#PostgreSQL for his help trouble shooting this. He also points out the difficulty in resolving this situation
20:33 < RhodiumToad> the one problem is that in pg you can do exists(select func() from ... where the func() is an SRF that might return 0 rows
An SRF is a set returning function.
In PostgreSQL, we can do for instance use an SRF to generate a series from 1-10 (generate_series
is in core)
SELECT * FROM generate_series(1,10);
And, we likewise can put it right here.
SELECT generate_series(1,10);
Two of them together give us a cross-join (cartesian product)
SELECT generate_series(1,10), generate_series(1,2);
But, if either of those return 0-rows you get nothing.. Effectually the same as this
SELECT * FROM ( VALUES (1) ) AS t(x)
CROSS JOIN ( SELECT 1 LIMIT 0 ) AS g;
And, that's the problem with optimizing this out entirely. You can have an SRF in a select-list inside of an EXIST statement that returns 0-rows, and forces the EXISTS to evaluate to false.
Best Answer
I figured it out. It's the damn parentheses around the columns in the
GROUP BY
clause. The correct query is: