Mysql – Full GROUP BY in MySQL 5.7 breaks de-dupe queries

MySQL

Prior to MySQL 5.7, I was able to dedupe queries by using GROUP BY as long as the query results were done in such a way that the duplicates had the same data and were already in the sort order I wanted.

I got really good at that.

Now, with 5.7 and only_full_group_by mode, those queries are invalid. I've already been through the 5 stages of grief over it, and am working to update / improve my queries to be more standard.

the problem is: when I do a LEFT JOIN (in some cases) it produces rows and rows of (essentially) duplicate information. I get that they are not technically duplicates because some column may be different. And, I see a lot of questions about this in a general way. One such use case is "find the highest value for X in the day." You used to just use ORDER BY x DESC with a GROUP BY and poof you had it. Now, you have to do LEFT JOINs and look for NULLs. (Part of me still wonders "why is this better?" but that's off topic).

….but I want to know what best practice replaced using GROUP BY to remove these duplicates? SELECT DISTINCT?

Best Answer

SELECT DISTINCT dedups on all columns. Sounds like that is what you want. GROUP BY has other purposes.

But I smell another problem here... A common pattern is

SELECT
    ... some columns, some aggregates
    FROM a
    JOIN b  ON ...
    WHERE ...
    GROUP BY some columns

This is rather inefficient (even when you don't run into the "only group by" restriction). First the JOIN is done. This inflates the number of rows, and builds a big tmp table. Then the GROUP BY is used to deflate it, often back to exactly the number of rows that are in one of the tables.

There are a couple of workarounds. But the depend on the details of the query (please show us the query).

If all the aggregates are on columns in, say, b, then use a derived table to first compute all the aggregates from b, then join the result of that to a to get the rest of the columns.

If there is only one aggregate, and even if the JOIN is actually LEFT JOIN, then

SELECT some columns,
       ( SELECT ... FROM b WHERE ... GROUP BY ...) -- correlated subquery
    FROM a
    -- no GROUP BY needed here, so no inflate-deflate inefficiency

Related Solutions

Mysql – Power went out – Did query finish

If you are running with binary logs enabled this can be checked with relatively high reliability.

First, to see whether binary logs are indeed enabled, run:

SHOW BINARY LOGS;

If they're enabled, you should get an output like this:

+------------------+-----------+
| Log_name         | File_size |
+------------------+-----------+
| mysql-bin.000244 |  15462544 |
| mysql-bin.000245 | 102622775 |
+------------------+-----------+

Otherwise you would get an error message.

Now, if binary logs are enabled, then any successful commit is also written to binary logs. I say "commit", but the truth is any successful operation, even on non transactional tables such as MyISAM, is written there. But, to be honest, to have any certainty about the outcome of your queries, I do hope for your sake you're using a transactional engine such as InnoDB, or else you cannot be sure of anything.

OK, so asssuming now that you have binary logs turned on, and that your tables are transactional (hopefully InnoDB), then a successful completion of your queries is expected to be written to the binary logs.

You must now hunt down the relevant binary log and look for the query there. If you find the query - good! If not - it's most probably not there. I'll explain shortly.

Which binary log contains your query? Look at the binary log files themselves, typically in your data directory. Look for their timestamps. When the power came up, a new binary log was created. Find it. Your queries are most probably in the binary log one before that one. This is a guess. It could also be one before that, etc. But it's a good guess.

Now, using the mysqlbinlog utility, execute, from command line, something like this:

mysqlbinlog mysql-bin.000245

Replace file name with the one you suspect to contain the query.

This will output all queries in this binary log file into standard output. On Unix, use grep to find your query:

mysqlbinlog mysql-bin.000245 | grep "something which identifies the query"

On Windows, good luck. Open with notepad++ or something and search manually.

Is the query there? Great - you know it has been committed.

Is the query not there? Need to check on sync_binlog param. Is it 1? Then query not in binary log ==> query not committed. But if sync_binlog is not 1, there could still be a chance that the query has been committed yet not in the binary log, as the crash may have happened just after the commit and just before the binary log was flushed to disk. You then need to revert to other means.

Those being: (and hopefully, again, you're using InnoDB): look for a single row that can identify outcome of the query. With InnoDB you get "all or nothing". If you can be certain of a single row affected by the query - you can be sure that the query has completed.

edit: of course, if slow log is enabled, you can expect such long query to be logged there after completion as well...

Good luck!

Mysql – group by clause without aggregate function

This:

SELECT users.* FROM users
INNER JOIN timesheets ON timesheets.user_id = users.id
WHERE (timesheets.submitted_at <= '2010-07-06 15:27:05.117700')
GROUP BY users.id

Finds all users who have a timesheet submitted on or before the given date. It's equivalent to:

SELECT DISTINCT users.* FROM users
INNER JOIN timesheets ON timesheets.user_id = users.id
WHERE (timesheets.submitted_at <= '2010-07-06 15:27:05.117700');

or:

SELECT  users.*
FROM users
WHERE EXISTS (
    SELECT 1
    FROM timesheets 
    WHERE timesheets.user_id = users.id
    AND timesheets.submitted_at <= '2010-07-06 15:27:05.117700'
);

It works because users.id is the primary key, so all other fields of users are functionally dependent on it. PostgreSQL knows that you don't have to use an aggregate to guarantee a single unambiguous result for each field in a row because there can only be one candidate users.name or whatever for any given users.id row.

(Older PostgreSQL versions didn't know how to identify functional dependencies of the primary key and and would throw an ERROR about needing to use an aggregate or include the field in the GROUP BY here).

Best Answer

Related Solutions

Mysql – Power went out – Did query finish

Mysql – group by clause without aggregate function

Related Question