Sql-server – Nested loops estimate way too low, causes tempdb spills

optimizationperformancequery-performancesql serversql-server-2016

I'm working on tuning a query, and I think the main problem right now is that there is a Nested Loops step early on with a row estimate that is way too low, causing too little memory to be allocated and downstream steps to spill to tempdb. The troublesome step is the one in the green box below.

Plan

https://www.brentozar.com/pastetheplan/?id=Sk8E-6YAM

The two inputs into the Nested Loops both have accurate row estimates of 2.4M rows, but the output of that step is estimated at only 37 rows, while the actual is 2.4M.

All three Table Spool steps below have the exact same estimated vs actual as well, which makes me think they're getting their estimate from the Nested Loops. Every branch spills to tempdb in a Sort step.

I'm thinking if I can just get the estimated rows out of the Nested Loops corrected, it will not only prevent spills on the top branch, but that the Table Spools will also inherit the correct estimate, get a sufficient memory grant, and also not spill.

SQL Server 2016 SP2 with the legacy CE on.

Here's the query

SELECT  Object18.Column1,
        Object18.Column3,
        Function1(Object19.Column12) AS Column13,
        Function2(Object19.Column12) AS Column14,
        Function3(DISTINCT (CASE WHEN Object19.Column15 = ? THEN Object18.Column6 END)) AS Column16,
        Function3(DISTINCT (CASE WHEN Object19.Column17 = ? THEN Object18.Column6 END)) AS Column18,
        Function3(DISTINCT Object18.Column6) AS Column19
from Object8 Object18
join Object20 Object19 on Object19.Column20 = Object18.Column7 and Object19.Column3 = Object18.Column3
where Object19.Column8 = ?
GROUP BY Object18.Column1, Object18.Column3
option (recompile)

Best Answer

I was able to get the estimates corrected by creating an index on the temp table after inserting. This solved the tempdb spills, but performance didn't improve much. Ultimately, a combination of separating the data query from the aggregation, and shuffling some other queries around to get maximum benefit of parallelization, combined to reduce the speed by 4x. Thanks everyone, especially Joe Obbish