Mysql – Speeding up a NOT IN subquery

innodbMySQLperformance

Good evening,

I'm currently up to try to improve the performance of some of my queries.
As far as I know, statements like "IN" or "NOT IN" are even faster with a large amount of values if a subquery on an indexed field without conditions is used.

SELECT * FROM table1 WHERE field1 NOT IN (SELECT index_field FROM table2)

When it comes to use conditions at table2, the query becomes realy slow on a large amount of data.

SELECT *
FROM table1
WHERE
    field1 NOT IN (
        SELECT
            index_field
        FROM table2
        WHERE
            user_id = '2'
    )

I was thinking about using LEFT Join but when I need to to the filtering for the user_id, it also becomes slow.

How may I solve this problem? Currently, I've no clue about that.

The original query looked sth. like this

SELECT
    i.*
FROM stream_item si
LEFT JOIN user__publisher_item ui
    ON ui.user_id = 2
    && ui.item_id = si.item_id
INNER JOIN item i
    ON i.id = si.item_id
WHERE 
    si.stream_id IN (5,7,8,9,19,24,29,42,43,44)
    && ui.id IS NULL || (
        ui.`read` = 0 || ui.saved = 1
    )
GROUP BY
    si.item_id
ORDER BY
    si.`found` DESC
LIMIT 0,10

id      select_type     table   type        possible_keys                                                                           key                 key_len     ref                 rows        Extra
1       SIMPLE          si      index       PRIMARY,stream_2_item,stream_id_found                                                   stream_2_item       4           \N                  663236      Using temporary; Using filesort
1       SIMPLE          ui      eq_ref      user_id_item_id,user_2_item,user_id_read_saved_hidden,user_id_saved,user_id_hidden      user_id_item_id     8           const,si.item_id    1           Using where
1       SIMPLE          i       eq_ref      PRIMARY                                                                                 PRIMARY             4           si.item_id          1   

I have a table which represents the n:m relationship between stream and items.
An unique item can be accessible via multiple streams.
Based upon this i have a state-table for user dependent states of an item.
Currently, the user-state-entry will only be generated, if the item was read or saved the first tim. Otherwise, there is no entry in the user_item table.
In the beginning, this query was fine but now, it takes a lot of time. Even if I leave out the "INNER JOIN" it takes 7 seconds on the current tables with around 500k rows each.

The table-structure in the background is as follows:

  • stream
  • stream_item //n:m relation between stream and items
  • item //unique items
  • stream_group //user-defined groups of multiple streams
  • stream_group_streams //_n:m relation between stream_groups and streams_
  • user
  • user_item //state-table that holds the user-dependent states of an item – if there is no state, no row exists for an item

I already asked this question at Stackoverflow but added my 2 Edits with further information to late so that I haven't got an answer anymore.

Best Answer

In general you can replace a NOT IN (...) subquery with a NOT EXISTS (...) and the latter will be faster. Logically they are the same though some database optimizers handle the NOT EXISTS better as they can stop the subquery check after the first match.

In contrast, if the DB optimizer isn't smart enough to figure out what your doing and the subquery has a lot of rows, then it must buffer all the rows for each subquery before performing each NOT IN check.

Try out something like this:

SELECT *
FROM table1 a
WHERE NOT EXISTS(SELECT 1
                   FROM table2 b
                  WHERE b.user_id = 2
                    AND b.index_field = a.field1)

In this example the optimizer need only check if a single row exists for the matching subquery, including the additional filter on b.user_id = 2. It should use an index on the index_field and user_id if they are available.