GROUP BY using columns not in the SELECT list – when is this practical, elegant or powerful

group by

TL;DR – I would like examples of where using a GROUP BY (example shown below) using columns not in the SELECT list can be used to resolve SQL challenges in a [practical | elegant | powerful] way. I mean in a general way – the example demonstrated below is interesting because it demonstrates the principle (but it doesn't work!). I want working examples where use of this technique can be used to achieve something "significant".

Following up to this question, which required the counting of routes from point_x to point_y and vice-versa, taking routes between the same points to be equivalent – i.e. A -> B is the same for the count purposes as B -> A. A working general solution to that question is given there.

However, one poster (SQLRaptor) showed a solution which solved the problem for the (presumably) subset of data shown in the question, but this poster also said that their solution wasn't general and asked the OP could they see why? SQLRaptor also said this was an example of a GROUP BY using columns not in the SELECT list and that this could be an elegant (powerful) solution to some SQL challenges!

SQLRaptor's solution (using my own notation from answer) was:

SELECT  MIN(origin) AS point_1, 
        MAX(destination) AS point_2,
        COUNT(*) AS journey_count
FROM route
GROUP BY ASCII(origin) * ASCII(destination)
ORDER BY point_1, point_2

I thought the non-generality might be something to do with the multiplication (duplicates), but had no concrete proof. This transpired to be the case – see my proof here and here.

Unfortunately, this example of using a GROUP BY using columns not in the SELECT list doesn't work very well – what I would like are examples of where this technique can work well?

Best Answer

I have used this technique (grouping on a column not used in the SELECT list) to combine an ordered list of values (which has gaps) into ranges of contiguous values.

First I do a rownum() on the ordered set of values. Then we diff the rownum() and the value. By grouping on the diff, we get the ranges.

CREATE TABLE #TMP (ID INT)

INSERT INTO #TMP 
SELECT 1 UNION 
SELECT 2 UNION 
SELECT 3 UNION 
SELECT 4 UNION 
SELECT 6 UNION 
SELECT 8 UNION 
SELECT 9 UNION 
SELECT 10 UNION 
SELECT 15 UNION 
SELECT 16 UNION 
SELECT 17 UNION 
SELECT 18 UNION 
SELECT 19 UNION 
SELECT 20



WITH RN AS 
(
    SELECT 
        ROW_NUMBER() OVER (ORDER BY ID) AS RN, ID 
    FROM 
        #TMP
),
SRC AS 
(
    SELECT 
        RN, 
        ID, 
        ID-RN DIFF 
    FROM 
        RN
)
SELECT 
    MIN(ID) RANGE_START, 
    MAX(ID) RANGE_END, 
    COUNT(*) CNT_VALUES_IN_RANGE 
FROM 
    SRC 
GROUP BY 
    DIFF

Related Solutions

Mysql – What corner cases exist when relying on undocumented behaviour to determine values selected by MySQL for hidden columns in GROUP BY operations

I was thinking the NATURAL JOIN example you just used

SELECT * FROM my_table NATURAL JOIN (
  SELECT   group_col, MAX(sort_col) sort_col
  FROM     my_table
  GROUP BY group_col
) t

If you shift to another type of JOIN and impose WHERE, ordering can come and go without warning in spite of the ill-advised reliance on undocumented behavior of the GROUP BY.

For this example, I will

use Windows 7
use MySQL 5.5.12-log for Windows
create some sample data
impose a LEFT JOIN without a WHERE clause
impose a LEFT JOIN with a WHERE clause

For the DB Environment

mysql> select version();
+------------+
| version()  |
+------------+
| 5.5.12-log |
+------------+
1 row in set (0.00 sec)

mysql> show variables like '%version_co%';
+-------------------------+------------------------------+
| Variable_name           | Value                        |
+-------------------------+------------------------------+
| version_comment         | MySQL Community Server (GPL) |
| version_compile_machine | x86                          |
| version_compile_os      | Win64                        |
+-------------------------+------------------------------+
3 rows in set (0.00 sec)

mysql>

Using this script to generate sample data

DROP DATABASE IF EXISTS eggyal;
CREATE DATABASE eggyal;
USE eggyal
CREATE TABLE groupby
(
    id int not null auto_increment,
    num int,
    primary key (id)
);
INSERT INTO groupby (num) VALUES
(floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
(floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
(floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
(floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp()));
INSERT INTO groupby (num) SELECT num FROM groupby;
SELECT * FROM groupby;

and these two queries for testing the GROUP BY subsequent use;

SELECT * FROM groupby A LEFT JOIN
(
    SELECT   num, MAX(id) id
    FROM     groupby
    GROUP BY num
) B USING (id);
SELECT * FROM groupby A LEFT JOIN
(
    SELECT   num, MAX(id) id
    FROM     groupby
    GROUP BY num
) B USING (id) WHERE B.num IS NOT NULL;

Let's test the durability of the GROUP BY's results;

STEP 01 : Create the Sample Data

mysql> DROP DATABASE IF EXISTS eggyal;
Query OK, 1 row affected (0.09 sec)

mysql> CREATE DATABASE eggyal;
Query OK, 1 row affected (0.00 sec)

mysql> USE eggyal
Database changed
mysql> CREATE TABLE groupby
    -> (
    ->     id int not null auto_increment,
    ->     num int,
    ->     primary key (id)
    -> );
Query OK, 0 rows affected (0.07 sec)

mysql> INSERT INTO groupby (num) VALUES
    -> (floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
    -> (floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
    -> (floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
    -> (floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp()));
Query OK, 8 rows affected (0.06 sec)
Records: 8  Duplicates: 0  Warnings: 0

mysql> INSERT INTO groupby (num) SELECT num FROM groupby;
Query OK, 8 rows affected (0.05 sec)
Records: 8  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM groupby;
+----+------------+
| id | num        |
+----+------------+
|  1 |  269529129 |
|  2 |  387090406 |
|  3 | 1126864683 |
|  4 |  411160755 |
|  5 |   29173595 |
|  6 |  266349579 |
|  7 | 1244227156 |
|  8 |    6231766 |
|  9 |  269529129 |
| 10 |  387090406 |
| 11 | 1126864683 |
| 12 |  411160755 |
| 13 |   29173595 |
| 14 |  266349579 |
| 15 | 1244227156 |
| 16 |    6231766 |
+----+------------+
16 rows in set (0.00 sec)

STEP 02 : Use `LEFT JOIN` without a `WHERE` clause

mysql> SELECT * FROM groupby A LEFT JOIN
    -> (
    ->     SELECT   num, MAX(id) id
    ->     FROM     groupby
    ->     GROUP BY num
    -> ) B USING (id);
+----+------------+------------+
| id | num        | num        |
+----+------------+------------+
|  1 |  269529129 |       NULL |
|  2 |  387090406 |       NULL |
|  3 | 1126864683 |       NULL |
|  4 |  411160755 |       NULL |
|  5 |   29173595 |       NULL |
|  6 |  266349579 |       NULL |
|  7 | 1244227156 |       NULL |
|  8 |    6231766 |       NULL |
|  9 |  269529129 |  269529129 |
| 10 |  387090406 |  387090406 |
| 11 | 1126864683 | 1126864683 |
| 12 |  411160755 |  411160755 |
| 13 |   29173595 |   29173595 |
| 14 |  266349579 |  266349579 |
| 15 | 1244227156 | 1244227156 |
| 16 |    6231766 |    6231766 |
+----+------------+------------+
16 rows in set (0.00 sec)

mysql>

STEP 03 : Use `LEFT JOIN` with a `WHERE` clause

mysql> SELECT * FROM groupby A LEFT JOIN
    -> (
    ->     SELECT   num, MAX(id) id
    ->     FROM     groupby
    ->     GROUP BY num
    -> ) B USING (id) WHERE B.num IS NOT NULL;
+----+------------+------------+
| id | num        | num        |
+----+------------+------------+
| 16 |    6231766 |    6231766 |
| 13 |   29173595 |   29173595 |
| 14 |  266349579 |  266349579 |
|  9 |  269529129 |  269529129 |
| 10 |  387090406 |  387090406 |
| 12 |  411160755 |  411160755 |
| 11 | 1126864683 | 1126864683 |
| 15 | 1244227156 | 1244227156 |
+----+------------+------------+
8 rows in set (0.00 sec)

mysql>

ANALYSIS

Looking at the aforementioned results, here are two questions:

Why does a LEFT JOIN keep an ordering by id ?
Why in the world did using a WHERE impose a reordering ?
- Was it during the JOIN phase ?
- Did the Query Optimizer look ahead at the ordering of the subquery or ignore it ?

No one foresaw any of these effects because the behavior of explicit clauses was relied upon by the implicit behavior of the Query Optimizer.

CONCLUSION

From my perspective, corner cases can only be of an external nature. In light of this, developers must be willing to fully evaluate the results of a GROUP BY in conjunction with the following twelve(12) aspects:

aggregate functions
subquery usage
JOINs clauses
WHERE clauses
sort order of results with no explicit ORDER BY clause
query results using older GA releases of MySQL
query results using newer beta releases of MySQL
the current SQL_MODE setting in my.cnf
the operating system the code was compiled for
possibly the size of join_buffer_size with respect to its effect on the Query Optimizer
possibly the size of sort_buffer_size with respect to its effect on the Query Optimizer
possibly the storage engine being used (MyISAM vs InnoDB)

Here is the key thing to remember : Any instance of MySQL that works for your query in a specific environment is itself a corner case. Once you change one or more of the twelve(12) evaluation aspects, the corner case is due to break, especially given the first nine(9) aspects.

Postgresql – Error “column does not exist” in a SELECT with JOIN and GROUP BY query

You probably want something like this:

SELECT h_v_charges.*, 
       last_v.last_version
FROM hist_versions_charges h_v_charges 
  JOIN (select proj_charge_id, 
               max(version) as last_version
        from hist_versions_charges 
        where proj_sous_projet_id = 2  
        group by proj_charge_id
  ) last_v  
  ON h_v_charges.version = last_v.last_version
 AND h_v_charges.proj_charge_id = last_v.proj_charge_id 
ORDER BY h_v_charges.proj_charge_id ASC;

A possibly (because no join is required) faster solution would be:

select *
from (
   select hvc.*, 
          row_number() over (partition by proj_charge_id order by version desc) as rn
   from hist_versions_charges as hvc
   where proj_sous_projet_id = 2  
) as hv
where rn = 1
order by hv.proj_charge_id ASC;

As Colin has pointed out, this can also be written as:

with hv as (
  select hvc.*, 
         row_number() over (partition by proj_charge_id order by version desc) as rn
  from hist_versions_charges as hvc
  where proj_sous_projet_id = 2  
) 
select *
from hv
where rn = 1
order by hv.proj_charge_id ASC;