Postgresql – GROUP BY, but use only one row per user

distinctgreatest-n-per-groupgroup bypostgresql

I've spent several hours to make a "simple" SELECT with GROUP BY in Postgres without success. The GROUP BY clause is giving me problems.

I have the table cities with columns user_id and city.
user_id can be repeated, so the table can have this information:

"Bill", "New York"
"Bill", "Chicago"
"Adam", "New York"
"Mike", "Los Angeles"
...

If I would like to have the count of cities it could be make this way:

SELECT cities.city, COUNT(*) FROM cities GROUP BY cities.city

But if I want to have this count and take only one city per user (it doesn't matter if "Bill" takes "New York" or "Chicago") how could I group by cities.user_id?

Best Answer

Your query does not exactly do a count of cities, but rather the count of users per listed city. To get that after de-duplicating users:

SELECT city, count(*) AS users
FROM  (
   SELECT DISTINCT ON (user_id) city
   FROM   cities
   ) sub
GROUP  BY city;

This picks one row per user_id arbitrarily like you specified. So we need no ORDER BY in the inner SELECT.

We need nothing but the city from the inner query for the bare count.

Detailed explanation for DISTINCT ON:

Not deterministic for arbitrary pick

The above is typically fastest for few rows per user_id, while implementing stated requirements.

But the result is not deterministic while we pick rows arbitrarily. Can return different numbers for repeated executions as Postgres is free to pick any row for one user. (The total sum over all cities is stable, though, being the count of users.)

The result is normally stable, but any change to the table can trigger a different result. Like autovacuum doing its job in the background, or any unrelated write operation on the table.

To get deterministic results you need to add a deterministic ORDER BY to the inner query, so that DISTINCT ON always picks the same row. Like:

SELECT city, count(*) AS users
FROM  (
   SELECT DISTINCT ON (user_id) city
   FROM   cities
   ORDER  BY user_id, city  -- making the pick determinisitic
   ) sub
GROUP  BY 1;

Which is equivalent to:

SELECT city, count(*) AS users
FROM  (
   SELECT min(city)
   FROM   cities
   GROUP  BY user_id
   ) sub
GROUP  BY 1;