Mysql – Filtering records not present with specific conditions in another table

MySQLperformancequery-performance

I have a table (let's call it extra_info) in MySQL that has various extra information/labels on user accounts:

account_id / data_key 
11 / 'test' 
11 / 'blah' 
12 / 'coupon a' 
12 / 'coupon b' 
15 / 'whatever' 
18 / 'test'

And so on.

I am building reports on other tables but need to filter out records that are present in the extra_info table as 'test'.

So in the example above, records 1-10, 13-14, 16-17 are fine because they are not in extra_info, 11 needs to be filtered out (it is 'test'), 12 is fine (but obviously just need to have it once even though is has 2 labels, 15 is fine (not 'test'), and 18 needs to be removed.

I have done this with subquery in the WHERE clause:

WHERE 
    -- filter out test labels 
    a.id NOT IN (SELECT ei.account_id 
                 FROM extra_info AS ei 
                 WHERE ei.data_key = 'test')

This does its job, however, I have a suspicion that a regular join would have worked faster. What's the right syntax for it?

If I just join and filter out the ones with data_key='test', #11 would still be there – so this is not a solution. Grouping by account_id will probably has performance implications and basically same impact as the current code, right?

Any ideas?

Best Answer

One option would look something like this:

SELECT a.* FROM account a
  LEFT JOIN extra_info ei ON ei.account_id = a.id AND ei.data_key = 'test'
 WHERE ei.account_id IS NULL;

This might seem counter-intuitive, since "obviously" er.account_id is never actually NULL sitting at rest in the table, so I'll explain:

The LEFT JOIN of course means all rows from account and only the matching rows from extra_data... and by "matching rows" we mean those having an identical account_id and a value of 'test' in data_key.

Every row in a result set has, at some point during execution, either a value (or a NULL) for every column from all tables, even if those columns aren't listed in the SELECT list... so, in a LEFT JOIN where the table on the right has no corresponding rows, the columns in the result set that "came from" the right-hand table are NULL... so, WHERE ei.account_id IS NULL filters out the rows with a value for ei.account_id (the ones with matching account_id and 'test' in data_key) leaving you with a result set containing the rows that you want.

If it's not apparent now, it makes more sense when you remember that the WHERE clause is not telling the server which rows to find in this case... it's specifying which rows not to eliminate.

Now, if you need other data from extra_info in your report, you will need to [left] join that same table again in your query with different aliases in addition to ei in order to pluck out the other values.

@ypercube's advice on indexes still stands... in the case of the query in the question, an index on (data_key,account_id) would probably be more useful, since the subquery isn't correlated to anything in the outer query, while in the LEFT JOIN query, this index would be what you'd want:

[UNIQUE] KEY(account_id,data_key)

Related Solutions

List partitioning or filtering a non-partitioned table, which incurs lower cost

The performance benefits of partition pruning depend on the proportion of rows being selected from the table and the efficiency with which the rows can be isolated to their own partition(s).

If you are selecting 5% of a table's rows and those rows are isolated into a subset of partitions then selecting them is generally going to be around twenty times faster than a full table scan. On the other hand, selecting them via an index could be more or less efficient than a full table scan, but is not going to approach twenty times as fast except maybe in very particular circumstances (perfect clustering of values, bitmap index access, SSD storage ...).

There's virtually no overhead to inserting into a partitioned table. One advantage of partitioning when it comes to inserts is that it can allow multiple simultaneous direct path inserts to the table, which is impossible on a non-partitioned table as the insert would require a segment-level exclusive lock.

MySQL – Select Records in One Table but Not in Another Without Matching Primary Key

MySQL, in fact, all SQL products really barf at

c.id=NULL

It causes confusion. You really mean

c.id IS NULL

You need to change it to

SELECT count(*) FROM list l LEFT OUTER JOIN cardinal c ON l.sku=c.sku where c.id is null;

Give it a Try !!!

ABOUT YOUR QUESTION

Your first query

SELECT count(*) FROM list l LEFT JOIN cardinal c ON c.id=null;

resembles a natural join where nothing matches on the cardinal table, so the count 2677513 is correct.

Your second query

SELECT count(*) FROM list l LEFT OUTER JOIN cardinal c ON l.sku=c.sku where c.id=null;

is little more explicit, but the c.id=null makes it fail, thus getting 0 as a count.

Best Answer

Related Solutions

List partitioning or filtering a non-partitioned table, which incurs lower cost

MySQL – Select Records in One Table but Not in Another Without Matching Primary Key

ABOUT YOUR QUESTION

Related Question