SQL Server 2012 optimizer does not get it right.
Test case, summary:
This is a simplified test scenario. DDL statements at the bottom.
I have two tables for data logging, A
and B
. There is a 1:n relationship – A
has header records with a datetime called a_time
and B
has detail records, with a field B.akey
are referencing A.id
, and fields name
and (data)
.
A has approx. 25,000,000 records, B
has aprox. 500,000,000 records. B
has roughly 200 records referencing
each record in A
. One A
and approx. 200 B
records are inserted at a time together every
five minutes, reflected by A.a_time
.
Clustered indexes are the primary keys, id
, type int identity.
B
has one non-clustered index, named IX_B_akey
, on B.akey
.
A.a_time
is (non-clustered) indexed, too.
Now this query:
SELECT A.a_time, B.*
FROM B
join A on B.akey = A.id
where
A.a_time > '2017-01-13T01:30:00' and A.a_time < '2017-01-14T07:30:00'
and B.name in ('name33', 'name55', 'name66')
takes about 3 minutes on my database server. Execution plan: here (see below for more exact execution plans)
When I add a simple hint to use
IX_B_akey
:
SELECT A.a_time, B.*
FROM B
with (index(IX_B_akey))
join A on B.akey = A.id
where
A.a_time > '2017-01-13T01:30:00' and A.a_time < '2017-01-14T07:30:00'
and B.name in ('name33', 'name55', 'name66')
it runs in less than one second. Execution plan:here (see below for more exact execution plans)
This does not change when I manually update statistics
on both tables.
The query plan for the query without the hint shows the server will do a table
scan on B
, looking for matching name
s. It is no surprise that this will take a while. With the hint, it uses the index and does a lookup via index for the
B
records referencing matching A
records. This is much faster.
I do not want to put query optimizer code into my software. Also, I use NHibernate.
Although it is possible, it would be ugly to use NHibernate interceptors and edit its
SQL.
Maybe the optimizer does not know that all B
records referencing one A
record are
physically next to each other. They are next to each other because they have been inserted
at the same time. If they were scattered throughout the database, it may be more expensive
to do all the lookups.
Question: How do I help the optimizer to choose the fast plan without a hint in the query? Can I add specific statistics to help here? Do I need a stored query plan?
For reference, here are the DDL statements used to create the tables.
create table A (
id int not null identity(1,1),
a_time datetime,
constraint pkA primary key (id)
)
create table B (
id int not null identity(1,1),
akey int not null references A (id),
name nvarchar(50),
d decimal(5,3),
constraint pkB primary key (id)
)
create index IX_B_akey on B (akey)
create index IX_A_a_time on A (a_time)
Update: Adding name
to the index IX_B_Akey
would probably help, but it would also nearly double the data volume. This is not a good option.
Update on execution plans: After posting the question, I created another test scenario with the same data structure but more data. The queries are the same, but the date range which is queried has been extended. The database contains 1 mio records in A and 200 mio records in B. This allows me to privide actual execution plans:
Best Answer
Bad plans come from hard choices. Instead of making the optimizer choose between a plan with two nested loops joins and a plan with a big parallel hash join, you could reorganize B to optimize for the access path from A to B.
The best index here is probably to make the clustered PK of B (akey,id). Then there would be only one nested loops join in the already-faster plan, making it obviously better than the parallel hash join plan.