I've spent several hours to make a "simple" SELECT
with GROUP BY
in Postgres without success. The GROUP BY
clause is giving me problems.
I have the table cities
with columns user_id
and city
.
user_id
can be repeated, so the table can have this information:
"Bill", "New York"
"Bill", "Chicago"
"Adam", "New York"
"Mike", "Los Angeles"
...
If I would like to have the count of cities it could be make this way:
SELECT cities.city, COUNT(*) FROM cities GROUP BY cities.city
But if I want to have this count and take only one city per user (it doesn't matter if "Bill" takes "New York" or "Chicago") how could I group by cities.user_id?
Best Answer
Your query does not exactly do a count of cities, but rather the count of users per listed city. To get that after de-duplicating users:
This picks one row per
user_id
arbitrarily like you specified. So we need noORDER BY
in the innerSELECT
.We need nothing but the
city
from the inner query for the bare count.Detailed explanation for
DISTINCT ON
:Not deterministic for arbitrary pick
The above is typically fastest for few rows per
user_id
, while implementing stated requirements.But the result is not deterministic while we pick rows arbitrarily. Can return different numbers for repeated executions as Postgres is free to pick any row for one user. (The total sum over all cities is stable, though, being the count of users.)
The result is normally stable, but any change to the table can trigger a different result. Like
autovacuum
doing its job in the background, or any unrelated write operation on the table.To get deterministic results you need to add a deterministic
ORDER BY
to the inner query, so thatDISTINCT ON
always picks the same row. Like:Which is equivalent to: