MySQL `where…in` statement – does it remove duplicates

MySQLmysql-5.7

I'm curious about if MySQL's where ... in (...) statement removes duplicates as an optimization.

For example, if I use a subquery, is it important that I use DISTINCT to remove duplicates?

An example query:

SELECT * FROM foo WHERE bar_id IN (
    SELECT id FROM bar where user_id = 4
);

The subquery has the potential to return duplicate bar.id's.

Does MySQL query optimization make this query better?

SELECT * FROM foo WHERE bar_id IN (
    SELECT DISTINCT id FROM bar where user_id = 4
);

From my tests on a database with about 2.5mm rows in the bar table (with indexes where appropriate for such a query), the result time is roughly the same (on a large RDS instance with plenty of overhead for now).

I should note that I'm hoping for an explanation more than "the subquery returns less results when using DISTINCT, so of course it's better", as that ignores MySQL's query optimizer.

For example, perhaps DISTINCT uses more resources and is therefore slower overall, especially if a where...in(...) statement easily optimizes duplicates. These are the details I'm unsure about.

Best Answer

Assuming that user_id is not a PRIMARY KEY, then MySQL will need to de-duplicate the subquery to apply the correct semantics. The DISTINCT keyword should not be required here since it doesn't change the semantics.

MySQL actually has more than one strategy for how to remove duplicates: https://dev.mysql.com/doc/refman/8.0/en/subquery-optimization.html

To see the strategy used, you need to paste the output from EXPLAIN FORMAT=JSON (it does not appear in the regular tabular EXPLAIN). You will see something like:

     "transformation": {
                   "select#": 2,
                   "from": "IN (SELECT)",
                   "to": "semijoin",
                   "chosen": true
                 }