SQL Server 2012 – Index Optimization on a View Using Full Outer Join

indexnormalizationsql-server-2012view

Table structure:

Foo             FooChild                Bar
---             --------                ---
ID              ID                      ID
Date            FooID                   Date
GroupID         UserID                  UserID
Notes           Amount                  GroupID
                                        IsComplete
  • Foo has a unique index on Date+GroupID
  • FooChild has a FK to Foo, and a unique index on FooID+UserID, which includes Amount
  • Bar has a unique index on Date+UserID+GroupID, which includes IsComplete

Now I need to create a report showing the sum of all FooChild amounts alongside the count of complete Bars for any given date range. The users also want to be able to see the stats per group or per user. This would seem to be a great place to write a view:

create view vFooBar as
select f.Date, f.GroupID, fc.UserID, fc.Amount, b.IsComplete
from Foo f join FooChild fc on fc.FooID = f.ID
left join Bar b on f.Date = b.Date and f.GroupID = b.GroupID and fc.UserID = b.UserID
union
select b.Date, b.GroupID, b.UserID, x.Amount, b.IsComplete
from Bar b left join 
    (select f.Date, f.GroupID, fc.UserID, fc.Amount
    from Foo f join FooChild fc on fc.FooID = f.ID) x
on x.Date = b.Date and x.GroupID = b.GroupID and x.UserID = b.UserID

(Here's why I wrote the view this way.)

Now I can easily write queries like this:

select UserID, sum(Amount) FooAmount, sum(cast(IsCompleted as int)) CompletedBars
from vFooBar
where Date between @fromDate and @toDate
group by UserID

But there's a snag here. As soon as the date range starts getting relatively big, the execution plan goes all pear shaped. It uses the date index on Foo, but instead of then using the FooID index on FooChild, it does a clustered index scan, then a hash match on FooID to join with the results from Foo. And it does that twice in the overall plan; I'm guessing once for each aggregate. And that really hurts.

I understand that using the index I created on FooChild may not be efficient, since the values of FooID for a given date might be discrete, though typically they are inserted in roughly the same order.

I could denormalize, and add Date and GroupID to the FooChild table, then index those columns, and I'm pretty sure that would improve performance a lot. But it just doesn't feel right.

Any other ideas?

Best Answer

The optimizer makes choices based on costing estimates. The cost model is generic, and may not always choose optimal plans for your particular hardware, and its assumptions may not always be valid for your circumstances.

In this case, the optimizer assesses a hash join as the cheaper option over nested loops when the estimated number of rows to be joined is large. If you are certain that a nested loops join will always be preferable to a hash join, you could consider (and test!) forcing a seek instead of a scan of the FooChild table in the view:

SELECT 
    f.TheDate, 
    f.GroupID, 
    fc.UserID, 
    fc.Amount, 
    b.IsComplete
FROM dbo.Foo AS f 
JOIN dbo.FooChild AS fc WITH (FORCESEEK) -- New hint
    ON fc.FooID = f.ID
LEFT JOIN dbo.Bar AS b 
    ON f.TheDate = b.TheDate 
    AND f.GroupID = b.GroupID 
    AND fc.UserID = b.UserID
UNION
SELECT 
    b.TheDate, 
    b.GroupID, 
    b.UserID, 
    x.Amount, 
    b.IsComplete
FROM dbo.Bar AS b 
LEFT JOIN 
(
    SELECT 
        f.TheDate, 
        f.GroupID, 
        fc.UserID, 
        fc.Amount
    FROM dbo.Foo AS f 
    JOIN dbo.FooChild AS fc WITH (FORCESEEK) -- New hint
        ON fc.FooID = f.ID
) AS x
    ON x.TheDate = b.TheDate 
    AND x.GroupID = b.GroupID 
    AND x.UserID = b.UserID;

Side note: while this transformation from the original full join is valid given the current uniqueness constraints on your tables, please review the answer to your previous question and consider rewriting the full join as suggested in my edit.