Mysql – Group only certain rows with GROUP BY

greatest-n-per-groupgroup byMySQL

Schema

I have the following set-up in MySQL database:

CREATE TABLE items (
  id SERIAL,
  name VARCHAR(100),
  group_id INT,
  price DECIMAL(10,2),
  KEY items_group_id_idx (group_id),
  PRIMARY KEY (id)
);

INSERT INTO items VALUES 
(1, 'Item A', NULL, 10),
(2, 'Item B', NULL, 20),
(3, 'Item C', NULL, 30),
(4, 'Item D', 1,    40),
(5, 'Item E', 2,    50),
(6, 'Item F', 2,    60),
(7, 'Item G', 2,    70);

Problem

I need to select:

  • All items with group_id that has NULL value, and
  • One item from each group identified by group_id having the lowest price.

Expected results

+----+--------+----------+-------+
| id | name   | group_id | price |
+----+--------+----------+-------+
|  1 | Item A |     NULL | 10.00 | 
|  2 | Item B |     NULL | 20.00 | 
|  3 | Item C |     NULL | 30.00 | 
|  4 | Item D |        1 | 40.00 | 
|  5 | Item E |        2 | 50.00 | 
+----+--------+----------+-------+

Possible solution 1

Two queries with UNION ALL:

SELECT id, name, group_id, price FROM items
WHERE group_id IS NULL
UNION ALL
SELECT id, name, MIN(price) FROM items
WHERE group_id IS NOT NULL
GROUP BY group_id;
/* EXPLAIN */
+----+--------------+------------+------+--------------------+--------------------+---------+-------+------+----------------------------------------------+
| id | select_type  | table      | type | possible_keys      | key                | key_len | ref   | rows | Extra                                        |
+----+--------------+------------+------+--------------------+--------------------+---------+-------+------+----------------------------------------------+
|  1 | PRIMARY      | items      | ref  | items_group_id_idx | items_group_id_idx | 5       | const |    3 | Using where                                  | 
|  2 | UNION        | items      | ALL  | items_group_id_idx | NULL               | NULL    | NULL  |    7 | Using where; Using temporary; Using filesort | 
| NULL | UNION RESULT | <union1,2> | ALL  | NULL               | NULL               | NULL    | NULL  | NULL |                                              | 
+----+--------------+------------+------+--------------------+--------------------+---------+-------+------+----------------------------------------------+

However it is undesirable to have two queries since there will be more complex condition in WHERE clause and I would need to sort the final results.

Possible solution 2

GROUP BY on expression (reference)

SELECT id, name, group_id, MIN(price) FROM items
GROUP BY CASE WHEN group_id IS NOT NULL THEN group_id ELSE RAND() END;
/* EXPLAIN */
+----+-------------+-------+------+---------------+------+---------+------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra                           |
+----+-------------+-------+------+---------------+------+---------+------+------+---------------------------------+
|  1 | SIMPLE      | items | ALL  | NULL          | NULL | NULL    | NULL |    7 | Using temporary; Using filesort | 
+----+-------------+-------+------+---------------+------+---------+------+------+---------------------------------+

Solution 2 seems to be faster and simple to use but I'm wondering whether there is a better approach in terms of performance.

Update: According to MySQL documentation, this query is illegal in SQL92 and earlier and may work in MySQL only.

Best Answer

First, do not use either of your two queries. Both have a group by some column (GROUP BY group_id) and then select other columns, non-aggregated (SELECT id, name). This may give you wrong and unexpected results, despite that it may work in your tests, with some small sized table.

Second, the UNION ALL is not a problem. If the two subqueries perform efficiently, then the final union is ok, too. If you need a sort, the efficiency will depend on how that sort differs from the indexes used.

Now, the problem of "groupwise-max" or "greatest-n-per-group" has many solutions (and even a tag, both at SO and here). There are two sub-problems, depending on whether ties can happen and what the wanted results are in those cases.

If you want all the tied rows, the solution with GROUP BY inside a derived table is usually good. In your case, that you want just one row returned per group, another approach is easier to write and usually performs very well when there is a small number of group overall:

SELECT id, name, price 
FROM items
WHERE group_id IS NULL

UNION ALL

SELECT i.id, i.name, i.price 
FROM 
    ( SELECT DISTINCT group_id 
      FROM items
      WHERE group_id IS NOT NULL
    ) AS di
  JOIN 
    items AS i
  ON  i.id = 
    ( SELECT id
      FROM items 
      WHERE group_id = di.group_id
      ORDER BY price, id             -- order for resolving ties
      LIMIT 1
    ) 
ORDER BY
    <some_columns> ;                 -- final order

An index on (group_id, price, id) will be helpful