I'm curious about if MySQL's where ... in (...)
statement removes duplicates as an optimization.
For example, if I use a subquery, is it important that I use DISTINCT
to remove duplicates?
An example query:
SELECT * FROM foo WHERE bar_id IN (
SELECT id FROM bar where user_id = 4
);
The subquery has the potential to return duplicate bar.id
's.
Does MySQL query optimization make this query better?
SELECT * FROM foo WHERE bar_id IN (
SELECT DISTINCT id FROM bar where user_id = 4
);
From my tests on a database with about 2.5mm rows in the bar
table (with indexes where appropriate for such a query), the result time is roughly the same (on a large RDS instance with plenty of overhead for now).
I should note that I'm hoping for an explanation more than "the subquery returns less results when using
DISTINCT
, so of course it's better", as that ignores MySQL's query optimizer.For example, perhaps
DISTINCT
uses more resources and is therefore slower overall, especially if awhere...in(...)
statement easily optimizes duplicates. These are the details I'm unsure about.
Best Answer
Assuming that
user_id
is not aPRIMARY KEY
, then MySQL will need to de-duplicate the subquery to apply the correct semantics. TheDISTINCT
keyword should not be required here since it doesn't change the semantics.MySQL actually has more than one strategy for how to remove duplicates: https://dev.mysql.com/doc/refman/8.0/en/subquery-optimization.html
To see the strategy used, you need to paste the output from
EXPLAIN FORMAT=JSON
(it does not appear in the regular tabular EXPLAIN). You will see something like: