Postgresql – GROUP BY, but use only one row per user

distinctgreatest-n-per-groupgroup bypostgresql

I've spent several hours to make a "simple" SELECT with GROUP BY in Postgres without success. The GROUP BY clause is giving me problems.

I have the table cities with columns user_id and city.
user_id can be repeated, so the table can have this information:

"Bill", "New York"
"Bill", "Chicago"
"Adam", "New York"
"Mike", "Los Angeles"
...

If I would like to have the count of cities it could be make this way:

SELECT cities.city, COUNT(*) FROM cities GROUP BY cities.city

But if I want to have this count and take only one city per user (it doesn't matter if "Bill" takes "New York" or "Chicago") how could I group by cities.user_id?

Best Answer

Your query does not exactly do a count of cities, but rather the count of users per listed city. To get that after de-duplicating users:

SELECT city, count(*) AS users
FROM  (
   SELECT DISTINCT ON (user_id) city
   FROM   cities
   ) sub
GROUP  BY city;

This picks one row per user_id arbitrarily like you specified. So we need no ORDER BY in the inner SELECT.

We need nothing but the city from the inner query for the bare count.

Detailed explanation for DISTINCT ON:

Select first row in each GROUP BY group?

Not deterministic for arbitrary pick

The above is typically fastest for few rows per user_id, while implementing stated requirements.

But the result is not deterministic while we pick rows arbitrarily. Can return different numbers for repeated executions as Postgres is free to pick any row for one user. (The total sum over all cities is stable, though, being the count of users.)

The result is normally stable, but any change to the table can trigger a different result. Like autovacuum doing its job in the background, or any unrelated write operation on the table.

To get deterministic results you need to add a deterministic ORDER BY to the inner query, so that DISTINCT ON always picks the same row. Like:

SELECT city, count(*) AS users
FROM  (
   SELECT DISTINCT ON (user_id) city
   FROM   cities
   ORDER  BY user_id, city  -- making the pick determinisitic
   ) sub
GROUP  BY 1;

Which is equivalent to:

SELECT city, count(*) AS users
FROM  (
   SELECT min(city)
   FROM   cities
   GROUP  BY user_id
   ) sub
GROUP  BY 1;

Related Solutions

Sql-server – How to select multiple columns but only group by one

In SQL Server you can only select columns that are part of the GROUP BY clause, or aggregate functions on any of the other columns. I've blogged about this in detail here. So you have two options:

Add the additional columns to the GROUP BY clause:

GROUP BY Rls.RoleName, Pro.[FirstName], Pro.[LastName]

Add some aggregate function on the relevant columns:

SELECT Rls.RoleName, MAX(Pro.[FirstName]), MAX(Pro.[LastName])

The second solution is mostly a workaround and an indication that you should fix something more general with your query.

Mysql – Group only certain rows with GROUP BY

First, do not use either of your two queries. Both have a group by some column (GROUP BY group_id) and then select other columns, non-aggregated (SELECT id, name). This may give you wrong and unexpected results, despite that it may work in your tests, with some small sized table.

Second, the UNION ALL is not a problem. If the two subqueries perform efficiently, then the final union is ok, too. If you need a sort, the efficiency will depend on how that sort differs from the indexes used.

Now, the problem of "groupwise-max" or "greatest-n-per-group" has many solutions (and even a tag, both at SO and here). There are two sub-problems, depending on whether ties can happen and what the wanted results are in those cases.

If you want all the tied rows, the solution with GROUP BY inside a derived table is usually good. In your case, that you want just one row returned per group, another approach is easier to write and usually performs very well when there is a small number of group overall:

SELECT id, name, price 
FROM items
WHERE group_id IS NULL

UNION ALL

SELECT i.id, i.name, i.price 
FROM 
    ( SELECT DISTINCT group_id 
      FROM items
      WHERE group_id IS NOT NULL
    ) AS di
  JOIN 
    items AS i
  ON  i.id = 
    ( SELECT id
      FROM items 
      WHERE group_id = di.group_id
      ORDER BY price, id             -- order for resolving ties
      LIMIT 1
    ) 
ORDER BY
    <some_columns> ;                 -- final order

An index on (group_id, price, id) will be helpful

Best Answer

Not deterministic for arbitrary pick

Related Solutions

Sql-server – How to select multiple columns but only group by one

Mysql – Group only certain rows with GROUP BY

Related Question