I was thinking the NATURAL JOIN
example you just used
SELECT * FROM my_table NATURAL JOIN (
SELECT group_col, MAX(sort_col) sort_col
FROM my_table
GROUP BY group_col
) t
If you shift to another type of JOIN
and impose WHERE
, ordering can come and go without warning in spite of the ill-advised reliance on undocumented behavior of the GROUP BY
.
For this example, I will
- use Windows 7
- use MySQL 5.5.12-log for Windows
- create some sample data
- impose a
LEFT JOIN
without a WHERE
clause
- impose a
LEFT JOIN
with a WHERE clause
For the DB Environment
mysql> select version();
+------------+
| version() |
+------------+
| 5.5.12-log |
+------------+
1 row in set (0.00 sec)
mysql> show variables like '%version_co%';
+-------------------------+------------------------------+
| Variable_name | Value |
+-------------------------+------------------------------+
| version_comment | MySQL Community Server (GPL) |
| version_compile_machine | x86 |
| version_compile_os | Win64 |
+-------------------------+------------------------------+
3 rows in set (0.00 sec)
mysql>
Using this script to generate sample data
DROP DATABASE IF EXISTS eggyal;
CREATE DATABASE eggyal;
USE eggyal
CREATE TABLE groupby
(
id int not null auto_increment,
num int,
primary key (id)
);
INSERT INTO groupby (num) VALUES
(floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
(floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
(floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
(floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp()));
INSERT INTO groupby (num) SELECT num FROM groupby;
SELECT * FROM groupby;
and these two queries for testing the GROUP BY
subsequent use;
SELECT * FROM groupby A LEFT JOIN
(
SELECT num, MAX(id) id
FROM groupby
GROUP BY num
) B USING (id);
SELECT * FROM groupby A LEFT JOIN
(
SELECT num, MAX(id) id
FROM groupby
GROUP BY num
) B USING (id) WHERE B.num IS NOT NULL;
Let's test the durability of the GROUP BY
's results;
STEP 01 : Create the Sample Data
mysql> DROP DATABASE IF EXISTS eggyal;
Query OK, 1 row affected (0.09 sec)
mysql> CREATE DATABASE eggyal;
Query OK, 1 row affected (0.00 sec)
mysql> USE eggyal
Database changed
mysql> CREATE TABLE groupby
-> (
-> id int not null auto_increment,
-> num int,
-> primary key (id)
-> );
Query OK, 0 rows affected (0.07 sec)
mysql> INSERT INTO groupby (num) VALUES
-> (floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
-> (floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
-> (floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp())),
-> (floor(rand() * unix_timestamp())),(floor(rand() * unix_timestamp()));
Query OK, 8 rows affected (0.06 sec)
Records: 8 Duplicates: 0 Warnings: 0
mysql> INSERT INTO groupby (num) SELECT num FROM groupby;
Query OK, 8 rows affected (0.05 sec)
Records: 8 Duplicates: 0 Warnings: 0
mysql> SELECT * FROM groupby;
+----+------------+
| id | num |
+----+------------+
| 1 | 269529129 |
| 2 | 387090406 |
| 3 | 1126864683 |
| 4 | 411160755 |
| 5 | 29173595 |
| 6 | 266349579 |
| 7 | 1244227156 |
| 8 | 6231766 |
| 9 | 269529129 |
| 10 | 387090406 |
| 11 | 1126864683 |
| 12 | 411160755 |
| 13 | 29173595 |
| 14 | 266349579 |
| 15 | 1244227156 |
| 16 | 6231766 |
+----+------------+
16 rows in set (0.00 sec)
STEP 02 : Use LEFT JOIN
without a WHERE
clause
mysql> SELECT * FROM groupby A LEFT JOIN
-> (
-> SELECT num, MAX(id) id
-> FROM groupby
-> GROUP BY num
-> ) B USING (id);
+----+------------+------------+
| id | num | num |
+----+------------+------------+
| 1 | 269529129 | NULL |
| 2 | 387090406 | NULL |
| 3 | 1126864683 | NULL |
| 4 | 411160755 | NULL |
| 5 | 29173595 | NULL |
| 6 | 266349579 | NULL |
| 7 | 1244227156 | NULL |
| 8 | 6231766 | NULL |
| 9 | 269529129 | 269529129 |
| 10 | 387090406 | 387090406 |
| 11 | 1126864683 | 1126864683 |
| 12 | 411160755 | 411160755 |
| 13 | 29173595 | 29173595 |
| 14 | 266349579 | 266349579 |
| 15 | 1244227156 | 1244227156 |
| 16 | 6231766 | 6231766 |
+----+------------+------------+
16 rows in set (0.00 sec)
mysql>
STEP 03 : Use LEFT JOIN
with a WHERE
clause
mysql> SELECT * FROM groupby A LEFT JOIN
-> (
-> SELECT num, MAX(id) id
-> FROM groupby
-> GROUP BY num
-> ) B USING (id) WHERE B.num IS NOT NULL;
+----+------------+------------+
| id | num | num |
+----+------------+------------+
| 16 | 6231766 | 6231766 |
| 13 | 29173595 | 29173595 |
| 14 | 266349579 | 266349579 |
| 9 | 269529129 | 269529129 |
| 10 | 387090406 | 387090406 |
| 12 | 411160755 | 411160755 |
| 11 | 1126864683 | 1126864683 |
| 15 | 1244227156 | 1244227156 |
+----+------------+------------+
8 rows in set (0.00 sec)
mysql>
ANALYSIS
Looking at the aforementioned results, here are two questions:
- Why does a
LEFT JOIN
keep an ordering by id
?
- Why in the world did using a
WHERE
impose a reordering ?
- Was it during the JOIN phase ?
- Did the Query Optimizer look ahead at the ordering of the subquery or ignore it ?
No one foresaw any of these effects because the behavior of explicit clauses was relied upon by the implicit behavior of the Query Optimizer.
CONCLUSION
From my perspective, corner cases can only be of an external nature. In light of this, developers must be willing to fully evaluate the results of a GROUP BY
in conjunction with the following twelve(12) aspects:
- aggregate functions
- subquery usage
JOINs
clauses
WHERE
clauses
- sort order of results with no explicit
ORDER BY
clause
- query results using older GA releases of MySQL
- query results using newer beta releases of MySQL
- the current SQL_MODE setting in
my.cnf
- the operating system the code was compiled for
- possibly the size of join_buffer_size with respect to its effect on the Query Optimizer
- possibly the size of sort_buffer_size with respect to its effect on the Query Optimizer
- possibly the storage engine being used (MyISAM vs InnoDB)
Here is the key thing to remember : Any instance of MySQL that works for your query in a specific environment is itself a corner case. Once you change one or more of the twelve(12) evaluation aspects, the corner case is due to break, especially given the first nine(9) aspects.
First things first, I notice that your 'what I do now' query:
SELECT TOP (1)
ca.SensorValue,
ca.Date
FROM sys.partitions AS p
CROSS APPLY
(
SELECT TOP (1)
v.Date,
v.SensorValue
FROM SensorValues AS v
WHERE
$PARTITION.SensorValues_Date_PF(v.Date) = p.[partition_number]
AND v.DeviceId = @fDeviceId
AND v.SensorId = @fSensorId
AND v.Date <= @fDate
ORDER BY
v.Date DESC
) AS ca
WHERE
p.[partition_number] <= $PARTITION.SensorValues_Date_PF(@fDate)
AND p.[object_id] = OBJECT_ID(N'dbo.SensorValues', N'U')
AND p.index_id = 1
ORDER BY
p.[partition_number] DESC,
ca.Date DESC;
...produces an execution plan like this:
This execution plan has an estimated total cost of 0.02 units. Over 50% of this estimated cost is the final Sort, running in Top-N mode. Now estimates are just that, but sorts can be expensive in general, so let's remove it without changing the semantics:
SELECT TOP (1)
ca.SensorId,
ca.SensorValue,
ca.Date
FROM
(
-- Partition numbers
SELECT DISTINCT
partition_number = prv.boundary_id
FROM
sys.partition_functions AS pf
JOIN sys.partition_range_values AS prv ON
prv.function_id = pf.function_id
WHERE
pf.name = N'SensorValues_Date_PF'
AND prv.boundary_id <= $PARTITION.SensorValues_Date_PF(@fDate)
) AS p
CROSS APPLY
(
SELECT TOP (1)
v.Date,
v.SensorValue,
v.SensorId
FROM dbo.SensorValues AS v
WHERE
$PARTITION.SensorValues_Date_PF(v.Date) = p.partition_number
AND v.DeviceId = @fDeviceId
AND v.SensorId = @fSensorId
AND v.Date <= @fDate
ORDER BY
v.Date DESC
) AS ca
ORDER BY
p.partition_number DESC,
ca.Date DESC
Now the execution plan has no blocking operators, and no sorts in particular. The estimated cost of the new query plan below is 0.01 units and the total cost is distributed evenly over the data access methods:
With the improvement in place, all we need to produce a result for each Sensor ID is to make a list of Sensor IDs and APPLY
the previous code to each one:
SELECT
PerSensor.SensorId,
PerSensor.SensorValue,
PerSensor.Date
FROM
(
-- Sensor ID list
VALUES
(@fSensorId1),
(@FSensorId2),
(@FSensorId3)
) AS Sensor (Id)
CROSS APPLY
(
-- Optimized code applied to each sensor
SELECT TOP (1)
ca.SensorId,
ca.SensorValue,
ca.Date
FROM
(
-- Partition numbers
SELECT DISTINCT
partition_number = prv.boundary_id
FROM
sys.partition_functions AS pf
JOIN sys.partition_range_values AS prv ON
prv.function_id = pf.function_id
WHERE
pf.name = N'SensorValues_Date_PF'
AND prv.boundary_id <= $PARTITION.SensorValues_Date_PF(@fDate)
) AS p
CROSS APPLY
(
SELECT TOP (1)
v.Date,
v.SensorValue,
v.SensorId
FROM dbo.SensorValues AS v
WHERE
$PARTITION.SensorValues_Date_PF(v.Date) = p.partition_number
AND v.DeviceId = @fDeviceId
AND v.SensorId = Sensor.Id--@fSensorId1
AND v.Date <= @fDate
ORDER BY
v.Date DESC
) AS ca
ORDER BY
p.partition_number DESC,
ca.Date DESC
) AS PerSensor;
The query plan is:
Estimated query plan cost for three Sensor IDs is 0.011 - half that of the original single-sensor plan.
Best Answer
This is an anti-semijoin. Can be written with
NOT IN
,LEFT JOIN / WHERE IS NULL
orNOT EXISTS
:You could also use
GROUP BY
with aCASE
expression: