I have a stored procedure that populates the temp table #employee_benefits with a list of Ids. This table ends up being roughly 10,000 rows long. The query below then selects from a table called EmployeeBenefitData that has around 4 million rows.
SELECT ebd.EmployeeBenefitDataId, ebd.EmployeeBenefitId, ebd.[DataDefinitionId]
FROM #employee_benefits eb
INNER JOIN EmployeeBenefitData ebd ON eb.EmployeeBenefitId = ebd.EmployeeBenefitId
The bottle neck was the index scan on the EmployeeBenefitData table. It did the index scan first and then joined it to the temp table. The temp table was acting as a filter meaning that scanning all the data before the join was very inefficient. I added the following code to change the scan to a seek and dramatically cut down the amount of reads needed.
DECLARE @MinEmpBenId INT, @MaxEmpBenId INT
SELECT @MinEmpBenId = MIN(EmployeeBenefitId), @MaxEmpBenId = MAX(EmployeeBenefitId)
FROM #employee_benefits
SELECT ebd.EmployeeBenefitDataId, ebd.EmployeeBenefitId, ebd.[DataDefinitionId],
dd.TypeId, dd.DataDefinitionId, dd.Name, ebd.[Value], ebd.[Date], ebd.[Text]
FROM #employee_benefits eb
INNER JOIN EmployeeBenefitData ebd ON eb.EmployeeBenefitId = ebd.EmployeeBenefitId
INNER JOIN DataDefinition dd ON ebd.DataDefinitionId = dd.DataDefinitionId
WHERE ebd.EmployeeBenefitId >= @MinEmpBenId AND ebd.EmployeeBenefitId <= @MaxEmpBenId
It makes a massive difference in the client stats
Total execution time 74, 1794
Wait time on server replies 11, 11
My question is: Is this good practice? And why doesn't the optimiser do this?
UPDATE
I should have mentioned that the temp table has a clustered index on EmployeeBenefitID
Best Answer
In this circumstance I would say yes. I'd probably also add an
OPTION (RECOMPILE)
to let it "sniff" the variable values. The optimal plan will likely vary dependant on the proportion of rows in the larger table that match this range.It provides a potentially useful extra path to the optimiser and it is not something that the query optimiser ever does by itself as far as I know. The closest thing to it is that with a merge join it will stop processing an input when either one is finished. Thus meaning that it potentially avoids a full scan.
The only downside that springs to mind would be if the calculation of the min/max range values itself might be expensive (but this should be very cheap if the table you are using as a filter is indexed on that column).
I created two test tables
And loaded EmployeeBenefitData with integers from 1 to 4,000,000 (6,456 pages)
And FilteredEmployee with integers from 2,000,000 AND 2,010,000 (19 pages)
And then ran 6 queries of the following form
The 6 permutations were made up by inverting the order of the two tables and trying all three join types
LOOP
,MERGE
,HASH
.Results were as follows
The above figures illustrate the point about merge join as it "only" scans just over half the larger table. It still read all the rows from 1 to 1,999,999 first though and discarded them.
Repeating the experiment with a
WHERE EBD.EmployeeID BETWEEN 2000000 AND 2010000
gave the following.The only query that didn't benefit from the additional range predicate is the one where the larger table was on the inside of a nested loops join.
This is of course not surprising as that plan (plan 1 below) is driven by repeated index seeks using the values from
FilteredEmployee
.Plan 1 was also the one chosen "naturally" by the optimiser without the range predicate. With the range predicate in place it chose a different merge join plan seeking into the relevant index range without scanning unnecessary rows and costed that as significantly cheaper (plan 2)