Sql-server – SQL Server 2008 R2: (OUTER APPLY) Why is SQL scanning all rows instead of the rows I specify

sql serversql-server-2008-r2

I have a reasonably large ETL table (170M rows, 40GB).

The clustered index is the EVENT_DTTM.

Every day a new batch of data is inserted and processed. Part of the processing is updating each row (only the newly inserted ones), based on other rows in the table. (e.g. compare the datetime of each row to the row before it and calculate "event time"; detect row with malformed data and use data from nearby rows to correct it.)

I'm using an OUTER APPLY (top 1) structure to do this. Basically, something of the form:

select *, event_duration = datediff(ss, alias.EVENT_DTTM, TABLE.EVENT_DTTM)
from TABLE
outer apply
  (select top 1 EVENT_DTTM
   from TABLE alias
   where alias.EVENT_DTTM >= @min_date
   and alias.EVENT_DTTM < @max_date
   and alias.ETC = TABLE.ETC) alias
where TABLE.EVENT_DTTM >= @min_date
and TABLE.EVENT_DTTM < @max_date

My problem is that SQL appears to be scanning the entire table for the OUTER APPLY, instead of just the range between @min_date and @max_date.

enter image description here

Here is an estimated query plan for the same query but with different parameters. The query does not run in parallel but the index spool is present.

Does anyone have an idea why this might be, and how to fix it?

I've tried building a CTE limited to the given date range, and then updating that. I've tried building a dynamic SQL string using literals for the dates. Neither of these has worked.

The only thing that I've found that works is to build a filtered index on the date range. SQL will then use the filtered index for the OUTER APPLY. This isn't very elegant and writing out the index costs a lot of time.

Unfortunately, I don't have time to implement and kind of partitioning scheme at this time.

For the sake of completeness, here is an example of my actual code:

update fact_cabinet_event_base
set assoc_event_device_id = assoc_event.device_id,
assoc_event_num = assoc_event.event_num
from fact_cabinet_event_base
outer apply
  (select top 1 device_id, event_num
   from fact_cabinet_event_base assoc_event
   where assoc_event.event_datetime >= dateadd(dd, -3, @populate_min_date)
   and assoc_event.event_datetime < @populate_max_date
   and assoc_event.event_type = 'dispense'
   and assoc_event.facility_applix_number = fact_cabinet_event_base.facility_applix_number
   and assoc_event.idr = fact_cabinet_event_base.idr
   and assoc_event.idri = fact_cabinet_event_base.idri
   and assoc_event.idria = fact_cabinet_event_base.idria
   and assoc_event.event_datetime < fact_cabinet_event_base.event_datetime
   order by assoc_event.event_datetime desc) assoc_event
where fact_cabinet_event_base.event_datetime >= @populate_min_date
and fact_cabinet_event_base.event_datetime < @populate_max_date
and (fact_cabinet_event_base.event_type = 'waste'
     or (fact_cabinet_event_base.event_type = 'return' and fact_cabinet_event_base.event_subtype <> 'return to pharmacy'))
and (fact_cabinet_event_base.assoc_event_num is NULL or fact_cabinet_event_base.assoc_event_num = 0)

Best Answer

When you see an index spool in a query plan what that usually means is that SQL Server thinks that the query really needs a suitable index on the table, so much so that it creates a temporary index for you. Often it can be resolved by creating a new index so that query plans aren't creating temporary indexes on the fly. As you might imagine query plans with index spools often aren't optimal for performance.

Let's examine why SQL Server thinks that the index spool is needed. In the query that you posted the execution plan for, SQL Server estimates that 351644 rows are going into the outer input for the nested loop join. If SQL Server used the clustered index on the table in the inner part of the nested loop join it would need to do the date filtering 351644 times. If you use local variables for the filtering as you did the optimizer isn't aware of the values and will fall back on a default cardinality estimate of 9% of the rows in the table. So from SQL Server's point of view, using the clustered index could result in 351644 * 176022000 * 0.09 = 5.5 trillion rows scanned from the table in the worst case.

For some queries it can help to add a RECOMPILE hint to the query to that SQL Server is aware of the values of the local variables when making the query plan. Since you're already using dynamic SQL you could also consider replacing the variables with date literals. Both of those will help with the estimate but I doubt that it will change the plan. In the query that you posted you have a date range of 5 days. With better information, if SQL Server used the index it could scan the entire same 5 days of data 351644 times in the worst case. In reality it could scan less rows than that because it only needs to find the first row that matches all of the predicates but that's a tough thing to estimate, especially with so many predicates.

So, SQL Server created an index spool in an attempt to speed up the query. There are a lot of things to dislike about the index spool: it will be built single-threaded even in a parallel plan, the query has to wait on the index to be built each time it runs, it scans the entire table, it takes up space in tempdb, and so on. If you want to avoid the index spool I recommend rewriting the query so that a nested loop join is not done, creating a more selective index against the Fact_Cabinet_Event_Base table, or creating a temp table to hold the rows from Fact_Cabinet_Event_Base that you need to query.

It should be possible to rewrite the query to not use APPLY. There isn't anything wrong with how you wrote the query, but in my experience APPLY tends to encourage the optimizer to use a nested loop join. If you want to keep the APPLY it's important to have a good index for the nested loop join. If you rewrite the query without APPLY perhaps you will get a hash join which could be faster. However, if it's important to avoid scanning the entire fact table then the hash join probably isn't what you want to happen.

If you wanted to create an index it's hard to give advice on which columns should be included and the order of the key columns. It'll depend on the selectivity of each column and how quickly you need this query to run. You could create a covering index for this query but I imagine it would be quite large.

If you can't afford to create an index then another option is to make your code use a temp table instead. Give the temp table the right clustered clustered index and filter out rows that you don't need based on event_datetime. This has a few advantages over letting SQL Server build an index spool for you. The query can be run in parallel, you don't need to scan the entire table because you can filter on event_datetime, and you can pick the order of the key columns in the temp table. However, your query will have to wait each time for the temp table to be created.