MySQL `where…in` statement – does it remove duplicates

MySQLmysql-5.7

I'm curious about if MySQL's where ... in (...) statement removes duplicates as an optimization.

For example, if I use a subquery, is it important that I use DISTINCT to remove duplicates?

An example query:

SELECT * FROM foo WHERE bar_id IN (
    SELECT id FROM bar where user_id = 4
);

The subquery has the potential to return duplicate bar.id's.

Does MySQL query optimization make this query better?

SELECT * FROM foo WHERE bar_id IN (
    SELECT DISTINCT id FROM bar where user_id = 4
);

From my tests on a database with about 2.5mm rows in the bar table (with indexes where appropriate for such a query), the result time is roughly the same (on a large RDS instance with plenty of overhead for now).

I should note that I'm hoping for an explanation more than "the subquery returns less results when using DISTINCT, so of course it's better", as that ignores MySQL's query optimizer.

For example, perhaps DISTINCT uses more resources and is therefore slower overall, especially if a where...in(...) statement easily optimizes duplicates. These are the details I'm unsure about.

Best Answer

Assuming that user_id is not a PRIMARY KEY, then MySQL will need to de-duplicate the subquery to apply the correct semantics. The DISTINCT keyword should not be required here since it doesn't change the semantics.

MySQL actually has more than one strategy for how to remove duplicates: https://dev.mysql.com/doc/refman/8.0/en/subquery-optimization.html

To see the strategy used, you need to paste the output from EXPLAIN FORMAT=JSON (it does not appear in the regular tabular EXPLAIN). You will see something like:

     "transformation": {
                   "select#": 2,
                   "from": "IN (SELECT)",
                   "to": "semijoin",
                   "chosen": true
                 }

Related Solutions

Mysql – How to select the latest record having one state where no later records exist with any other state

SELECT widget, MAX(`timestamp`) AS ts
FROM tableX AS t
WHERE state = 'down'
GROUP BY widget
HAVING NOT EXISTS
       ( SELECT *
         FROM tableX AS tt
         WHERE tt.widget = t.widget
           AND tt.state <> 'down'
           AND tt.`timestamp` > MAX(t.`timestamp`)
       ) ;

I think that you'll need two indices, one on (widget, state, timestamp) and one on (widget, timestamp, state) for efficiency.

This will work, too, and will be needing only one index, on (widget, timestamp, state):

SELECT t.widget, t.`timestamp`
FROM 
        tableX AS t
    JOIN
        ( SELECT widget, MAX(`timestamp`) AS ts
          FROM tableX
          GROUP BY widget
        ) AS tm
            ON  tm.widget = t.widget
            AND tm.ts = t.`timestamp`
WHERE t.state = 'down' ;

Tested both at SQL-Fiddle: test

Mysql – Why does copied MySQL database have different data than source

Honestly, I expected you to say MyISAM since I've never seen InnoDB generate a backup that seems inconsistent with what you think is in the table.

As a DBA, I don't trust my data to anything that I can't watch while it's working and peer into its intermediate stages, so I rarely use GUI tools. My approach to this issue would be to use mysqldump to extract the data into a file and then review that file manually. Satisfied with the sanity of it, I would load it onto the target server with the mysql command line client and see what the results are.

The specific option I'd suggest with mysqldump in this case would be --skip-extended-insert, which generates one insert statement for each row row in each table. This is in contrast to the normal format, which combines groups of rows into a smaller number of insert statements as an optimization for faster restoring... a file created with this option will restore somewhat more slowly than one created without it, but the tradeoff is that the files are substantially easier to parse with your eyeballs and see what the data looks like during this intermediate stage of the process.

Since my bet on MyISAM didn't pan out so well, I'm really unsure whether to put money on the dump file looking right and some kind of strangeness on the target server to scramble it up on the way in... or for the dump file to come out looking wrong, though I really will be surprised if the dump file comes out looking wrong... but then again, if the dump file looks right, I'd be pretty surprised if it doesn't restore correctly.

After reviewing the file, if the content looks right (i.e., find the line in the file that represents the 438/2255 row you mentioned above, and see which version it represents... and make sure there's only one such line in the dump file) load it onto the target server and see what the results look like. If the contents of the dump file don't look right, then of course, restoring it would be unnecessary and we need to look at the origin server for something strange.

Best Answer

Related Solutions

Mysql – How to select the latest record having one state where no later records exist with any other state

Mysql – Why does copied MySQL database have different data than source

Related Question