SQL Server 2012 – How to Filter Table Before Join for Better Performance

performancequery-performancesql serversql-server-2012

I have a stored procedure that populates the temp table #employee_benefits with a list of Ids. This table ends up being roughly 10,000 rows long. The query below then selects from a table called EmployeeBenefitData that has around 4 million rows.

SELECT  ebd.EmployeeBenefitDataId, ebd.EmployeeBenefitId, ebd.[DataDefinitionId]
FROM    #employee_benefits eb
INNER JOIN EmployeeBenefitData ebd ON eb.EmployeeBenefitId = ebd.EmployeeBenefitId

The bottle neck was the index scan on the EmployeeBenefitData table. It did the index scan first and then joined it to the temp table. The temp table was acting as a filter meaning that scanning all the data before the join was very inefficient. I added the following code to change the scan to a seek and dramatically cut down the amount of reads needed.

DECLARE @MinEmpBenId INT, @MaxEmpBenId INT

SELECT @MinEmpBenId = MIN(EmployeeBenefitId), @MaxEmpBenId = MAX(EmployeeBenefitId)
FROM #employee_benefits

SELECT  ebd.EmployeeBenefitDataId, ebd.EmployeeBenefitId, ebd.[DataDefinitionId],
        dd.TypeId, dd.DataDefinitionId, dd.Name, ebd.[Value], ebd.[Date], ebd.[Text]
FROM    #employee_benefits eb
INNER JOIN EmployeeBenefitData ebd ON eb.EmployeeBenefitId = ebd.EmployeeBenefitId
INNER JOIN DataDefinition dd ON ebd.DataDefinitionId = dd.DataDefinitionId
WHERE   ebd.EmployeeBenefitId >= @MinEmpBenId AND ebd.EmployeeBenefitId <= @MaxEmpBenId

It makes a massive difference in the client stats

Total execution time 74, 1794
Wait time on server replies 11, 11

My question is: Is this good practice? And why doesn't the optimiser do this?

UPDATE
I should have mentioned that the temp table has a clustered index on EmployeeBenefitID

Best Answer

Is this good practice?

In this circumstance I would say yes. I'd probably also add an OPTION (RECOMPILE) to let it "sniff" the variable values. The optimal plan will likely vary dependant on the proportion of rows in the larger table that match this range.

It provides a potentially useful extra path to the optimiser and it is not something that the query optimiser ever does by itself as far as I know. The closest thing to it is that with a merge join it will stop processing an input when either one is finished. Thus meaning that it potentially avoids a full scan.

The only downside that springs to mind would be if the calculation of the min/max range values itself might be expensive (but this should be very cheap if the table you are using as a filter is indexed on that column).

I created two test tables

CREATE TABLE EmployeeBenefitData(EmployeeID INT PRIMARY KEY);

CREATE TABLE FilteredEmployee(EmployeeID INT PRIMARY KEY);

And loaded EmployeeBenefitData with integers from 1 to 4,000,000 (6,456 pages)

And FilteredEmployee with integers from 2,000,000 AND 2,010,000 (19 pages)

And then ran 6 queries of the following form

DECLARE @E1 INT,
        @E2 INT

SELECT @E1 = FE.EmployeeID,
       @E2 = EBD.EmployeeID
FROM   FilteredEmployee FE
       INNER LOOP JOIN EmployeeBenefitData EBD
         ON FE.EmployeeID = EBD.EmployeeID
OPTION (MAXDOP 1);

The 6 permutations were made up by inverting the order of the two tables and trying all three join types LOOP, MERGE, HASH.

Results were as follows

+------------+-------------+-------+----------------+-----------+---------------+----------+
| Left Table | Right Table | Join  | EBD Scan Count | EBD reads | FE Scan Count | FE reads |
+------------+-------------+-------+----------------+-----------+---------------+----------+
| FE         | EBD         | Loop  |              0 |     30637 |             1 |       19 |
| EBD        | FE          | Loop  |              1 |      6456 |             0 |  8250009 |
| FE         | EBD         | Merge |              1 |      3257 |             1 |       19 |
| EBD        | FE          | Merge |              1 |      3257 |             1 |       19 |
| FE         | EBD         | Hash  |              1 |      6456 |             1 |       19 |
| EBD        | FE          | Hash  |              1 |      6456 |             1 |       19 |
+------------+-------------+-------+----------------+-----------+---------------+----------+

The above figures illustrate the point about merge join as it "only" scans just over half the larger table. It still read all the rows from 1 to 1,999,999 first though and discarded them.

Repeating the experiment with a WHERE EBD.EmployeeID BETWEEN 2000000 AND 2010000 gave the following.

+------------+-------------+-------+----------------+-----------+---------------+----------+
| Left Table | Right Table | Join  | EBD Scan Count | EBD reads | FE Scan Count | FE reads |
+------------+-------------+-------+----------------+-----------+---------------+----------+
| FE         | EBD         | Loop  |              0 |     30637 |             1 |       19 |
| EBD        | FE          | Loop  |              1 |        21 |             0 |    20636 |
| FE         | EBD         | Merge |              1 |        21 |             1 |       19 |
| EBD        | FE          | Merge |              1 |        21 |             1 |       19 |
| FE         | EBD         | Hash  |              1 |        21 |             1 |       19 |
| EBD        | FE          | Hash  |              1 |        21 |             1 |       19 |
+------------+-------------+-------+----------------+-----------+---------------+----------+

The only query that didn't benefit from the additional range predicate is the one where the larger table was on the inside of a nested loops join.

This is of course not surprising as that plan (plan 1 below) is driven by repeated index seeks using the values from FilteredEmployee.

enter image description here

Plan 1 was also the one chosen "naturally" by the optimiser without the range predicate. With the range predicate in place it chose a different merge join plan seeking into the relevant index range without scanning unnecessary rows and costed that as significantly cheaper (plan 2)