Mysql – Filtering records not present with specific conditions in another table

MySQLperformancequery-performance

I have a table (let's call it extra_info) in MySQL that has various extra information/labels on user accounts:

account_id / data_key 
11 / 'test' 
11 / 'blah' 
12 / 'coupon a' 
12 / 'coupon b' 
15 / 'whatever' 
18 / 'test'

And so on.

I am building reports on other tables but need to filter out records that are present in the extra_info table as 'test'.

So in the example above, records 1-10, 13-14, 16-17 are fine because they are not in extra_info, 11 needs to be filtered out (it is 'test'), 12 is fine (but obviously just need to have it once even though is has 2 labels, 15 is fine (not 'test'), and 18 needs to be removed.

I have done this with subquery in the WHERE clause:

WHERE 
    -- filter out test labels 
    a.id NOT IN (SELECT ei.account_id 
                 FROM extra_info AS ei 
                 WHERE ei.data_key = 'test')

This does its job, however, I have a suspicion that a regular join would have worked faster. What's the right syntax for it?

If I just join and filter out the ones with data_key='test', #11 would still be there – so this is not a solution. Grouping by account_id will probably has performance implications and basically same impact as the current code, right?

Any ideas?

Best Answer

One option would look something like this:

SELECT a.* FROM account a
  LEFT JOIN extra_info ei ON ei.account_id = a.id AND ei.data_key = 'test'
 WHERE ei.account_id IS NULL;

This might seem counter-intuitive, since "obviously" er.account_id is never actually NULL sitting at rest in the table, so I'll explain:

The LEFT JOIN of course means all rows from account and only the matching rows from extra_data... and by "matching rows" we mean those having an identical account_id and a value of 'test' in data_key.

Every row in a result set has, at some point during execution, either a value (or a NULL) for every column from all tables, even if those columns aren't listed in the SELECT list... so, in a LEFT JOIN where the table on the right has no corresponding rows, the columns in the result set that "came from" the right-hand table are NULL... so, WHERE ei.account_id IS NULL filters out the rows with a value for ei.account_id (the ones with matching account_id and 'test' in data_key) leaving you with a result set containing the rows that you want.

If it's not apparent now, it makes more sense when you remember that the WHERE clause is not telling the server which rows to find in this case... it's specifying which rows not to eliminate.

Now, if you need other data from extra_info in your report, you will need to [left] join that same table again in your query with different aliases in addition to ei in order to pluck out the other values.

@ypercube's advice on indexes still stands... in the case of the query in the question, an index on (data_key,account_id) would probably be more useful, since the subquery isn't correlated to anything in the outer query, while in the LEFT JOIN query, this index would be what you'd want:

[UNIQUE] KEY(account_id,data_key)