Improve performance with the WHERE NOT IN sub-select clause

execution-planfirebirdperformancequery-performancesubquery

In the following query, I have to count transactions for each customer. [EDIT]However, I have to exclude from the result set entirely, customers that have a transaction older than one year.

Shouldn't the query optimizer be smart enough to only evaluate existence once for each customer?

--Count transactions on customers that are less than 1 year old

  SELECT t1.CUSTID,COUNT(*)
  FROM CUST_TRX t1
  WHERE NOT EXISTS ( 
    SELECT FIRST 1 1 
    FROM CUST_TRX t2 
    WHERE 
      t2.CUSTID=t1.CUSTID AND
      t2.DATED<CURRENT_DATE-365
    GROUP BY t2.CUSTID
  )
  GROUP BY t1.CUSTID

There are no naturals in my query plan. This query is performing as if the database is running the existence clause for every transaction instead of running it for every customer. Performance is the same if I remove the GROUP BY in the sub-query.

Is there a better way to do this so that I may get better performance out of the database? Hopefully a simple select query will work avoiding a CTE if possible (that would introduce other challenges). Due to other group by criteria (not shown here) I'm not able to simply check MIN(DATED), I really need to perform another query.

Best Answer

With queries like this is it often more efficient to perform a LEFT OUTER JOIN instead of the NOT EXISTS style check, it often implies a full index scan (or table scan without the right indexes in place) but with many rows in the main table(s) this is less expensive than the large number of index seeks (one on the reference table for each row returned from the main table) that would otherwise result. Some query planners are quite bright about spotting this equivalence and using the alternate plan where it is the better choice, but it doesn't sound like this has happened in your case.

Try something like:

SELECT t1.CUSTID, COUNT(*)
FROM   CUST_TRX t1
LEFT OUTER JOIN
       CUST_TRX t2 
ON     t2.CUSTID=t1.CUSTID 
AND    t2.DATED<CURRENT_DATE-365
WHERE  t2.CUSTID IS NULL
GROUP BY t1.CUSTID

(note: I'm not familiar with firebird, so the above syntax may need tweaks but should illustrate the point)

Without the WHERE t2.CUSTID IS NULL every row from t1 with matches in t2 will be output once for every match found in t2 and those with no matches in t2 will be output once but with any columns selected from that object set to NULL. The WHERE clause then screens out the matches.

Depending on the DB engine's abilities, especially if the amount of data in the reference object (CUST_TRX with a filter applied here) is huge, this may be significantly less efficient than the WHERE <something> NOT IN or WHERE NOT EXISTS options, so benchmark over realistic data sets first before using the method. It often works out much more efficient with MS SQL Server in cases where the query planner doesn't notice that the WHERE NOT IN arrangement can be performed this way more efficiently.

Also if you do it this way around leave a comment in the code (and/or supporting documentation) to say that you are doing this as an equivalent to WHERE <something> NOT IN or WHERE NOT EXISTS which you expect to be more efficient. You'll remember it and an experienced SQL person will recognise the pattern, but other people looking at the code might not immediately understand the intent/reason and flip it back to using WHERE NOT EXISTS for clarity as that reads better as on English sentence.