Mysql – Full GROUP BY in MySQL 5.7 breaks de-dupe queries

MySQL

Prior to MySQL 5.7, I was able to dedupe queries by using GROUP BY as long as the query results were done in such a way that the duplicates had the same data and were already in the sort order I wanted.

I got really good at that.

Now, with 5.7 and only_full_group_by mode, those queries are invalid. I've already been through the 5 stages of grief over it, and am working to update / improve my queries to be more standard.

the problem is: when I do a LEFT JOIN (in some cases) it produces rows and rows of (essentially) duplicate information. I get that they are not technically duplicates because some column may be different. And, I see a lot of questions about this in a general way. One such use case is "find the highest value for X in the day." You used to just use ORDER BY x DESC with a GROUP BY and poof you had it. Now, you have to do LEFT JOINs and look for NULLs. (Part of me still wonders "why is this better?" but that's off topic).

….but I want to know what best practice replaced using GROUP BY to remove these duplicates? SELECT DISTINCT?

Best Answer

SELECT DISTINCT dedups on all columns. Sounds like that is what you want. GROUP BY has other purposes.

But I smell another problem here... A common pattern is

SELECT
    ... some columns, some aggregates
    FROM a
    JOIN b  ON ...
    WHERE ...
    GROUP BY some columns

This is rather inefficient (even when you don't run into the "only group by" restriction). First the JOIN is done. This inflates the number of rows, and builds a big tmp table. Then the GROUP BY is used to deflate it, often back to exactly the number of rows that are in one of the tables.

There are a couple of workarounds. But the depend on the details of the query (please show us the query).

If all the aggregates are on columns in, say, b, then use a derived table to first compute all the aggregates from b, then join the result of that to a to get the rest of the columns.

If there is only one aggregate, and even if the JOIN is actually LEFT JOIN, then

SELECT some columns,
       ( SELECT ... FROM b WHERE ... GROUP BY ...) -- correlated subquery
    FROM a
    -- no GROUP BY needed here, so no inflate-deflate inefficiency