SQL Server – Does Query Optimizer Prefer Constants Over Columns?

indexperformancequery-performancesql server

I think I have found the answer for this already but I am hoping to get some additional perspective.

Assume we are JOINing two tables together on a shared column and each table then has a different column we are going to do a Constant search on. When we build an index to support the query, for each table do we want to put the JOINing column first, or the Constant column first? I am thinking now it is the Constant column first. When I look at a query plan for a different query that prompted this question, it appears it tries to create a subset of each table and then JOIN them together. Instead of JOINing the two tables togeather and filtering down from there.

EX: Joining Shipments to Customers where Shipment is Shipped and Customer is Active

SELECT [Columns]
FROM Shipment S
   INNER JOIN Customer C
      ON S.CustomerID = C.CustomerID
WHERE S.IsShipped = 1
AND C.IsActive = 1

I am thinking the two best indexes to use are below. Because the Query Optimizer would prefer to scan the Constant first then JOIN on the 2nd column instead of JOINing the two tables together and filtering on the constant after that.

CREATE NONCLUSTERED INDEX [IX_IsActive-CustomerID] ON [dbo].[Customer]
(
    [IsActive] ASC,
    [CustomerID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO

CREATE NONCLUSTERED INDEX [IX_IsShipped-CustomerID] ON [dbo].[Shipment]
(
    [IsShipped] ASC,
    [CustomerID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO

Instead of:

CREATE NONCLUSTERED INDEX [IX_CustomerID-IsActive] ON [dbo].[Customer]
(
    [CustomerID] ASC,
    [IsActive] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO

CREATE NONCLUSTERED INDEX [IX_CustomerID-IsShipped] ON [dbo].[Shipment]
(
    [CustomerID] ASC,
    [IsShipped] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO

Is that right?

Best Answer

Assuming there are no other queries of concern, the general answer is that you want more selective criteria in the leading columns of the index (note that selectivity here is in relation to the predicate rather than a measure of how unique the particular values in the column are). In general, you want the optimizer to eliminate as many rows as possible as quickly as possible.

Assuming that CustomerID in Shipment is NOT NULL and has a foreign key to Customer, meaning that the inner join is guaranteed not to eliminate any rows from Supplier, then s.IsShipped = 1 is the only selective predicate and it would make sense for it to be the leading column. If, on the other hand, the inner join was more selective (imagine that someone moved rows from Customer to Customer_Archive periodically and Shipment could join to either of those tables), it would make sense to have CustomerID as the leading column.

Related Solutions

Mysql – inner join on PK with extra criteria slow despite indices

The optimizer does not see that your conditions are correlated and picks the wrong access method.

Basically, it considers two options:

Scan the index on siteVisitId until the first match on site_visits and the first satisfied timestamp condition.
Scan the index on timestamp until the first match on site_visits.

Since timestamp is a part of the primary key and siteVisitId is not, the second plan would involve table lookups on product_views which is several times more slow than a pure index scan (note Using index in the plan).

The optimizer calculates the conditional probability of the timestamp condition being satisfied (given that a corresponding site_visit record exists) and compares it to the overhead of the table access.

Since your timestamp condition is quite wide (as seen on the index histograms), the optimizer prefers the first method.

However, since both siteVisitId and timestamp are incremental, they are correlated and the conditional probability of both matches is not a mere product of their independent probabilities.

In simple words, you have to filter through many low siteVisitId until you find the first matching timestamp, which is exactly what is happening to your query.

You should add ORDER BY timestamp to your query to make the timestamp index cheaper as it won't have to sort. It would also help to create an index on timestamp, siteVisitId (in this order) to avoid table lookups.

Sql-server – Differences Between Two Different Create Index Commands

It boils down to looking what the default values are. Lets break this down:

CREATE UNIQUE NONCLUSTERED INDEX [DEID_MAP_IDX1] ON [dbo].[DEID_MAP]

nonclustered is specified here. The default (i.e. nothing specified) is nonclustered. So unless clustered is specified it will default to nonclustered. So that's the same in both scripts.

[dbo] is specified here explicitly. As for the second un-specified CREATE INDEX then it all depends on what the current user's default schema is. Only you can answer that at the moment, so that may or may not default to dbo.

WITH (
    PAD_INDEX  = OFF, 
    STATISTICS_NORECOMPUTE  = OFF, 
    IGNORE_DUP_KEY  = OFF, 
    ALLOW_ROW_LOCKS = ON, 
    ALLOW_PAGE_LOCKS = ON
) ON [PRIMARY]

PAD_INDEX: the default is OFF, so unspecified will be the same in the second script as it is in the first.

STATISTICS_NORECOMPUTE: the default is OFF, so the second script unspecified has the same value.

IGNORE_DUP_KEY: the default is OFF, so the second CREATE INDEX is identical with this parameter.

ALLOW_ROW_LOCKS: the default is ON, so the second CREATE script has the same behavior.

ALLOW_PAGE_LOCKS: the default is ON...the second script has identical behavior.

... ON [PRIMARY]: just like the default schema one, this all depends on what your default filegroup is. If PRIMARY is the default filegroup, your second CREATE INDEX script will also create the index on PRIMARY. If PRIMARY is not the default filegroup, then it will be a different filegroup, as an unspecified filegroup will go to the default filegroup.

All of this information and default values can be found on this BOL reference here.

Best Answer

Related Solutions

Mysql – inner join on PK with extra criteria slow despite indices

Sql-server – Differences Between Two Different Create Index Commands

Related Question