Table structure:
Foo FooChild Bar
--- -------- ---
ID ID ID
Date FooID Date
GroupID UserID UserID
Notes Amount GroupID
IsComplete
Foo
has a unique index onDate
+GroupID
FooChild
has a FK to Foo, and a unique index onFooID
+UserID
, which includesAmount
Bar
has a unique index onDate
+UserID
+GroupID
, which includesIsComplete
Now I need to create a report showing the sum of all FooChild amounts alongside the count of complete Bars for any given date range. The users also want to be able to see the stats per group or per user. This would seem to be a great place to write a view:
create view vFooBar as
select f.Date, f.GroupID, fc.UserID, fc.Amount, b.IsComplete
from Foo f join FooChild fc on fc.FooID = f.ID
left join Bar b on f.Date = b.Date and f.GroupID = b.GroupID and fc.UserID = b.UserID
union
select b.Date, b.GroupID, b.UserID, x.Amount, b.IsComplete
from Bar b left join
(select f.Date, f.GroupID, fc.UserID, fc.Amount
from Foo f join FooChild fc on fc.FooID = f.ID) x
on x.Date = b.Date and x.GroupID = b.GroupID and x.UserID = b.UserID
(Here's why I wrote the view this way.)
Now I can easily write queries like this:
select UserID, sum(Amount) FooAmount, sum(cast(IsCompleted as int)) CompletedBars
from vFooBar
where Date between @fromDate and @toDate
group by UserID
But there's a snag here. As soon as the date range starts getting relatively big, the execution plan goes all pear shaped. It uses the date index on Foo
, but instead of then using the FooID
index on FooChild
, it does a clustered index scan, then a hash match on FooID
to join with the results from Foo
. And it does that twice in the overall plan; I'm guessing once for each aggregate. And that really hurts.
I understand that using the index I created on FooChild
may not be efficient, since the values of FooID
for a given date might be discrete, though typically they are inserted in roughly the same order.
I could denormalize, and add Date
and GroupID
to the FooChild table, then index those columns, and I'm pretty sure that would improve performance a lot. But it just doesn't feel right.
Any other ideas?
Best Answer
The optimizer makes choices based on costing estimates. The cost model is generic, and may not always choose optimal plans for your particular hardware, and its assumptions may not always be valid for your circumstances.
In this case, the optimizer assesses a hash join as the cheaper option over nested loops when the estimated number of rows to be joined is large. If you are certain that a nested loops join will always be preferable to a hash join, you could consider (and test!) forcing a seek instead of a scan of the
FooChild
table in the view:Side note: while this transformation from the original full join is valid given the current uniqueness constraints on your tables, please review the answer to your previous question and consider rewriting the full join as suggested in my edit.