Mysql – What corner cases exist when relying on undocumented behaviour to determine values selected by MySQL for hidden columns in GROUP BY operations

group byMySQL

Under GROUP BY and HAVING with Hidden Columns, the MySQL manual documents (added emphasis):

MySQL extends the use of GROUP BY so that the select list can refer to nonaggregated columns not named in the GROUP BY clause. This means that the preceding query is legal in MySQL. You can use this feature to get better performance by avoiding unnecessary column sorting and grouping. However, this is useful primarily when all values in each nonaggregated column not named in the GROUP BY are the same for each group. The server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate.

Despite this explicit warning from the developers, some people continue to rely on undocumented behaviour to determine the value that will be selected from a hidden column.

In particular, MySQL often appears to select the "first" record from each group (where the notion of "first" is itself undocumented, such as the oldest record on some storage engines or according to some sort order applied to a materialised table from a subquery). I've seen this exploited to retrieve, for example, a groupwise maximum:

SELECT * FROM (
  SELECT * FROM my_table ORDER BY sort_col DESC
) t GROUP BY group_col

For completeness, the same can be accomplished in a standard and documented fashion with a simple join:

SELECT * FROM my_table NATURAL JOIN (
  SELECT   group_col, MAX(sort_col) sort_col
  FROM     my_table
  GROUP BY group_col
) t

I believe that one should never rely on undocumented behaviour because there may be unforeseen corner cases that cause that behaviour to break. For example, in satisfying a GROUP BY operation with an index, MySQL sorts the results and may thereby choose an unexpected value.

What other corner cases can break this behaviour? Or is it sufficiently reliable for production systems?

Best Answer

I was thinking the NATURAL JOIN example you just used

SELECT * FROM my_table NATURAL JOIN (
  SELECT   group_col, MAX(sort_col) sort_col
  FROM     my_table
  GROUP BY group_col
) t

If you shift to another type of JOIN and impose WHERE, ordering can come and go without warning in spite of the ill-advised reliance on undocumented behavior of the GROUP BY.

For this example, I will

use Windows 7
use MySQL 5.5.12-log for Windows
create some sample data
impose a LEFT JOIN without a WHERE clause
impose a LEFT JOIN with a WHERE clause

For the DB Environment

mysql> select version();
+------------+
| version()  |
+------------+
| 5.5.12-log |
+------------+
1 row in set (0.00 sec)

mysql> show variables like '%version_co%';
+-------------------------+------------------------------+
| Variable_name           | Value                        |
+-------------------------+------------------------------+
| version_comment         | MySQL Community Server (GPL) |
| version_compile_machine | x86                          |
| version_compile_os      | Win64                        |
+-------------------------+------------------------------+
3 rows in set (0.00 sec)

mysql>

Using this script to generate sample data

DROP DATABASE IF EXISTS eggyal;
CREATE DATABASE eggyal;
USE eggyal
CREATE TABLE groupby
(
    id int not null auto_increment,
    num int,
    primary key (id)
);
INSERT INTO groupby (num) VALUES
(floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
(floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
(floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
(floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp()));
INSERT INTO groupby (num) SELECT num FROM groupby;
SELECT * FROM groupby;

and these two queries for testing the GROUP BY subsequent use;

SELECT * FROM groupby A LEFT JOIN
(
    SELECT   num, MAX(id) id
    FROM     groupby
    GROUP BY num
) B USING (id);
SELECT * FROM groupby A LEFT JOIN
(
    SELECT   num, MAX(id) id
    FROM     groupby
    GROUP BY num
) B USING (id) WHERE B.num IS NOT NULL;

Let's test the durability of the GROUP BY's results;

STEP 01 : Create the Sample Data

mysql> DROP DATABASE IF EXISTS eggyal;
Query OK, 1 row affected (0.09 sec)

mysql> CREATE DATABASE eggyal;
Query OK, 1 row affected (0.00 sec)

mysql> USE eggyal
Database changed
mysql> CREATE TABLE groupby
    -> (
    ->     id int not null auto_increment,
    ->     num int,
    ->     primary key (id)
    -> );
Query OK, 0 rows affected (0.07 sec)

mysql> INSERT INTO groupby (num) VALUES
    -> (floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
    -> (floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
    -> (floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
    -> (floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp()));
Query OK, 8 rows affected (0.06 sec)
Records: 8  Duplicates: 0  Warnings: 0

mysql> INSERT INTO groupby (num) SELECT num FROM groupby;
Query OK, 8 rows affected (0.05 sec)
Records: 8  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM groupby;
+----+------------+
| id | num        |
+----+------------+
|  1 |  269529129 |
|  2 |  387090406 |
|  3 | 1126864683 |
|  4 |  411160755 |
|  5 |   29173595 |
|  6 |  266349579 |
|  7 | 1244227156 |
|  8 |    6231766 |
|  9 |  269529129 |
| 10 |  387090406 |
| 11 | 1126864683 |
| 12 |  411160755 |
| 13 |   29173595 |
| 14 |  266349579 |
| 15 | 1244227156 |
| 16 |    6231766 |
+----+------------+
16 rows in set (0.00 sec)

STEP 02 : Use `LEFT JOIN` without a `WHERE` clause

mysql> SELECT * FROM groupby A LEFT JOIN
    -> (
    ->     SELECT   num, MAX(id) id
    ->     FROM     groupby
    ->     GROUP BY num
    -> ) B USING (id);
+----+------------+------------+
| id | num        | num        |
+----+------------+------------+
|  1 |  269529129 |       NULL |
|  2 |  387090406 |       NULL |
|  3 | 1126864683 |       NULL |
|  4 |  411160755 |       NULL |
|  5 |   29173595 |       NULL |
|  6 |  266349579 |       NULL |
|  7 | 1244227156 |       NULL |
|  8 |    6231766 |       NULL |
|  9 |  269529129 |  269529129 |
| 10 |  387090406 |  387090406 |
| 11 | 1126864683 | 1126864683 |
| 12 |  411160755 |  411160755 |
| 13 |   29173595 |   29173595 |
| 14 |  266349579 |  266349579 |
| 15 | 1244227156 | 1244227156 |
| 16 |    6231766 |    6231766 |
+----+------------+------------+
16 rows in set (0.00 sec)

mysql>

STEP 03 : Use `LEFT JOIN` with a `WHERE` clause

mysql> SELECT * FROM groupby A LEFT JOIN
    -> (
    ->     SELECT   num, MAX(id) id
    ->     FROM     groupby
    ->     GROUP BY num
    -> ) B USING (id) WHERE B.num IS NOT NULL;
+----+------------+------------+
| id | num        | num        |
+----+------------+------------+
| 16 |    6231766 |    6231766 |
| 13 |   29173595 |   29173595 |
| 14 |  266349579 |  266349579 |
|  9 |  269529129 |  269529129 |
| 10 |  387090406 |  387090406 |
| 12 |  411160755 |  411160755 |
| 11 | 1126864683 | 1126864683 |
| 15 | 1244227156 | 1244227156 |
+----+------------+------------+
8 rows in set (0.00 sec)

mysql>

ANALYSIS

Looking at the aforementioned results, here are two questions:

Why does a LEFT JOIN keep an ordering by id ?
Why in the world did using a WHERE impose a reordering ?
- Was it during the JOIN phase ?
- Did the Query Optimizer look ahead at the ordering of the subquery or ignore it ?

No one foresaw any of these effects because the behavior of explicit clauses was relied upon by the implicit behavior of the Query Optimizer.

CONCLUSION

From my perspective, corner cases can only be of an external nature. In light of this, developers must be willing to fully evaluate the results of a GROUP BY in conjunction with the following twelve(12) aspects:

aggregate functions
subquery usage
JOINs clauses
WHERE clauses
sort order of results with no explicit ORDER BY clause
query results using older GA releases of MySQL
query results using newer beta releases of MySQL
the current SQL_MODE setting in my.cnf
the operating system the code was compiled for
possibly the size of join_buffer_size with respect to its effect on the Query Optimizer
possibly the size of sort_buffer_size with respect to its effect on the Query Optimizer
possibly the storage engine being used (MyISAM vs InnoDB)

Here is the key thing to remember : Any instance of MySQL that works for your query in a specific environment is itself a corner case. Once you change one or more of the twelve(12) evaluation aspects, the corner case is due to break, especially given the first nine(9) aspects.

Related Solutions

Mysql – Is it possible to reduce a group to a row based on some criteria

The traditional solution, the one you may find in books, is to do a self join: first find that "max date per group" of yours, then join to self table on rows with said max date.

However, some hacks allow you to avoid that. Consider the following query:

SELECT
  MAX(date_added) AS date_added,
  SUBSTRING_INDEX(GROUP_CONCAT(some_column ORDER BY date_added DESC), ',', 1) AS some_column,
  SUBSTRING_INDEX(GROUP_CONCAT(another_column ORDER BY date_added DESC), ',', 1) AS another_column
FROM t
GROUP BY whatever

GROUP_CONCAT is an aggregation function which implodes values onto one string. It allows for ORDER BY, which we utilize via ORDER BY date_added DESC so as to implode our desired value first. We then slice up the first token in the string via SUBSTRING_INDEX.

The downside here (apart from making the query quite the frightening appearance) is that your numerical values are transformed into texts. Typically no big deal with SQL, but please be aware.

There's another option where you do a semi-self-join, a much lighter one; you will have to give up usage of index. It's quite long to describe; it still uses GROUP_CONCAT and SUBSTRING_INDEX, but only for the purpose of creating a derived table with only relevant keys. This derived table is then joined with original table. See an example in SQL: selecting top N records per group, another solution.

Mysql – UPDATE/INSERT into Dynamic Columns

Unfortunately, Dynamic SQL is not permitted in Triggers. Notwithstanding, what you are asking for is still possible, but you will have to apply some elbow grease.

I wrote an answer to the post Disable trigger for just one table. What I did was create a user variable @TRIGGER_DISABLED as a flag to control if the code within the trigger was to be executed.

In your particular case, you will need 9 flags. You will also need a BEFORE UPDATE trigger:

DELIMITER $$
CREATE TRIGGER `setup_record_time` BEFORE UPDATE ON `my_table`
FOR EACH ROW 
BEGIN
  SET @rec_col1 = IF(NEW.col1=OLD.col1,0,1);
  SET @rec_col2 = IF(NEW.col2=OLD.col2,0,1);
  SET @rec_col3 = IF(NEW.col3=OLD.col3,0,1);
  SET @rec_col4 = IF(NEW.col4=OLD.col4,0,1);
  SET @rec_col5 = IF(NEW.col5=OLD.col5,0,1);
  SET @rec_col6 = IF(NEW.col6=OLD.col6,0,1);
  SET @rec_col7 = IF(NEW.col7=OLD.col7,0,1);
  SET @rec_col8 = IF(NEW.col8=OLD.col8,0,1);
  SET @rec_col9 = IF(NEW.col9=OLD.col9,0,1);
END $$
DELIMITER $$

The AFTER UPDATE trigger looks like this:

DELIMITER $$
CREATE TRIGGER `record_time` AFTER UPDATE ON `my_table`
FOR EACH ROW 
BEGIN
  IF @rec_col1 = 1 THEN INSERT INTO `record_changes` (col1) values (CUR_DATE()); END IF;
  IF @rec_col2 = 1 THEN INSERT INTO `record_changes` (col2) values (CUR_DATE()); END IF;
  IF @rec_col3 = 1 THEN INSERT INTO `record_changes` (col3) values (CUR_DATE()); END IF;
  IF @rec_col4 = 1 THEN INSERT INTO `record_changes` (col4) values (CUR_DATE()); END IF;
  IF @rec_col5 = 1 THEN INSERT INTO `record_changes` (col5) values (CUR_DATE()); END IF;
  IF @rec_col6 = 1 THEN INSERT INTO `record_changes` (col6) values (CUR_DATE()); END IF;
  IF @rec_col7 = 1 THEN INSERT INTO `record_changes` (col7) values (CUR_DATE()); END IF;
  IF @rec_col8 = 1 THEN INSERT INTO `record_changes` (col8) values (CUR_DATE()); END IF;
  IF @rec_col9 = 1 THEN INSERT INTO `record_changes` (col9) values (CUR_DATE()); END IF;
END $$
DELIMITER $$

Just to be on the safe side, you should initialize these 9 user variables with 0.

SET @rec_col1 = 0;
SET @rec_col2 = 0;
SET @rec_col3 = 0;
SET @rec_col4 = 0;
SET @rec_col5 = 0;
SET @rec_col6 = 0;
SET @rec_col7 = 0;
SET @rec_col8 = 0;
SET @rec_col9 = 0;
UPDATE mytable ... ;

That way, your session dictates the trigger's behavior.

Give it a Try !!!