Sql-server – Is the WHERE-JOIN-ORDER-(SELECT) rule for index column order wrong

indexindex-designindex-tuningsql serversql-server-2008-r2

I am trying to improve this (sub-) query being part of a larger query:

select SUM(isnull(IP.Q, 0)) as Q, 
        IP.OPID 
    from IP
        inner join I
        on I.ID = IP.IID
    where 
        IP.Deleted=0 and
        (I.Status > 0 AND I.Status <= 19) 
    group by IP.OPID

Sentry Plan Explorer pointed out some relatively expensive Key Lookups for table dbo.[I] performed by the query above.

Table dbo.I

    CREATE TABLE [dbo].[I] (
  [ID]  UNIQUEIDENTIFIER NOT NULL,
  [OID]  UNIQUEIDENTIFIER NOT NULL,
  []  UNIQUEIDENTIFIER NOT NULL,
  []  UNIQUEIDENTIFIER NOT NULL,
  [] UNIQUEIDENTIFIER NULL,
  []  UNIQUEIDENTIFIER NOT NULL,
  []  CHAR (3) NOT NULL,
  []  CHAR (3)  DEFAULT ('EUR') NOT NULL,
  []  DECIMAL (18, 8) DEFAULT ((1)) NOT NULL,
  [] CHAR (10)  NOT NULL,
  []  DECIMAL (18, 8) DEFAULT ((1)) NOT NULL,
  []  DATETIME  DEFAULT (getdate()) NOT NULL,
  []  VARCHAR (35) NULL,
  [] NVARCHAR (100) NOT NULL,
  []  NVARCHAR (100) NULL,
  []  NTEXT  NULL,
  []  NTEXT  NULL,
  []  NTEXT  NULL,
  []  NTEXT  NULL,
  [Status]  INT DEFAULT ((0)) NOT NULL,
  []  DECIMAL (18, 2)  NOT NULL,
  [] DECIMAL (18, 2)  NOT NULL,
  [] DECIMAL (18, 2)  NOT NULL,
  [] DATETIME DEFAULT (getdate()) NULL,
  []  DATETIME NULL,
  []  NTEXT  NULL,
  []  NTEXT  NULL,
  [] TINYINT  DEFAULT ((0)) NOT NULL,
  []  DATETIME NULL,
  []  VARCHAR (50) NULL,
  []  DATETIME  DEFAULT (getdate()) NOT NULL,
  []  VARCHAR (50) NOT NULL,
  []  DATETIME NULL,
  []  VARCHAR (50) NULL,
  []  ROWVERSION NOT NULL,
  []  DATETIME NULL,
  []  INT  NULL,
  [] TINYINT DEFAULT ((0)) NOT NULL,
  [] UNIQUEIDENTIFIER NULL,
  []  TINYINT DEFAULT ((0)) NOT NULL,
  []  TINYINT  DEFAULT ((0)) NOT NULL,
  [] NVARCHAR (50)  NULL,
  [] TINYINT DEFAULT ((0)) NOT NULL,
  []  UNIQUEIDENTIFIER NULL,
  [] UNIQUEIDENTIFIER NULL,
  []  TINYINT  DEFAULT ((0)) NOT NULL,
  []  TINYINT DEFAULT ((0)) NOT NULL,
  []  UNIQUEIDENTIFIER NULL,
  []  DECIMAL (18, 2)  NULL,
  []  DECIMAL (18, 2)  NULL,
  [] DECIMAL (18, 2)  DEFAULT ((0)) NOT NULL,
  [] UNIQUEIDENTIFIER NULL,
  [] DATETIME NULL,
  [] DATETIME NULL,
  []  VARCHAR (35) NULL,
  [] DECIMAL (18, 2)  DEFAULT ((0)) NOT NULL,
  CONSTRAINT [PK_I] PRIMARY KEY NONCLUSTERED ([ID] ASC) WITH (FILLFACTOR = 90),
  CONSTRAINT [FK_I_O] FOREIGN KEY ([OID]) REFERENCES [dbo].[O] ([ID]),
  CONSTRAINT [FK_I_Status] FOREIGN KEY ([Status]) REFERENCES [dbo].[T_Status] ([Status])
);                  


GO
CREATE CLUSTERED INDEX [CIX_Invoice]
  ON [dbo].[I]([OID] ASC) WITH (FILLFACTOR = 90);

Table dbo.IP

CREATE TABLE [dbo].[IP] (
 [ID] UNIQUEIDENTIFIER DEFAULT (newid()) NOT NULL,
 [IID] UNIQUEIDENTIFIER NOT NULL,
 [OID] UNIQUEIDENTIFIER NOT NULL,
 [Deleted] TINYINT DEFAULT ((0)) NOT NULL,
 [] UNIQUEIDENTIFIER NULL,
 [] UNIQUEIDENTIFIER NULL,
 [] UNIQUEIDENTIFIER NULL,
 [] UNIQUEIDENTIFIER NULL,
 [] UNIQUEIDENTIFIER NULL,
 []UNIQUEIDENTIFIER NULL,
 [] UNIQUEIDENTIFIER NULL,
 [] UNIQUEIDENTIFIER NULL,
 [] INT NOT NULL,
 [] VARCHAR (35) NULL,
 [] NVARCHAR (100) NOT NULL,
 [] NTEXT NULL,
 [] DECIMAL (18, 4) DEFAULT ((0)) NOT NULL,
 [] NTEXT NULL,
 [] NTEXT NULL,
 [] DECIMAL (18, 4) DEFAULT ((0)) NOT NULL,
 [] DECIMAL (18, 4) DEFAULT ((0)) NOT NULL,
 [] DECIMAL (4, 2) NOT NULL,
 [] INT DEFAULT ((1)) NOT NULL,
 [] DATETIME DEFAULT (getdate()) NOT NULL,
 [] VARCHAR (50) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
 [] DATETIME NULL,
 [] VARCHAR (50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
 [] ROWVERSION NOT NULL,
 [] INT DEFAULT ((1)) NOT NULL,
 [] DATETIME NULL,
 [] UNIQUEIDENTIFIER NULL,
 [] DECIMAL (18, 4) DEFAULT ((1)) NOT NULL,
 [] DECIMAL (18, 4) DEFAULT ((1)) NOT NULL,
 [] INT DEFAULT ((0)) NOT NULL,
 [] UNIQUEIDENTIFIER NULL,
 [] UNIQUEIDENTIFIER NULL,
 [] UNIQUEIDENTIFIER NULL,
 [] UNIQUEIDENTIFIER NULL,
 []UNIQUEIDENTIFIER NULL,
 []NVARCHAR (35) NULL,
 [] VARCHAR (35) NULL,
 [] NVARCHAR (35) NULL,
 [] NVARCHAR (35) NULL,
 [] NVARCHAR (35) NULL,
 [] NVARCHAR (35) NULL,
 [] NVARCHAR (35) NULL,
 [] NVARCHAR (35) NULL,
 [] UNIQUEIDENTIFIER NULL,
 [] VARCHAR (12) NULL,
 [] VARCHAR (4) NULL,
 [] NVARCHAR (50) NULL,
 [] NVARCHAR (50) NULL,
 [] VARCHAR (35) NULL,
 [] UNIQUEIDENTIFIER NULL,
 [] UNIQUEIDENTIFIER NULL,
 [] TINYINT DEFAULT ((0)) NOT NULL,
 [] UNIQUEIDENTIFIER NULL,
 [] TINYINT DEFAULT ((0)) NOT NULL,
 [] TINYINT DEFAULT ((0)) NOT NULL,
 [] NVARCHAR (50) NULL,
 [] TINYINT DEFAULT ((0)) NOT NULL,
 [] DECIMAL (18, 2) NULL,
 []TINYINT DEFAULT ((1)) NOT NULL,
 [] UNIQUEIDENTIFIER NULL,
 [] UNIQUEIDENTIFIER NULL,
 [] UNIQUEIDENTIFIER NULL,
 [] TINYINT DEFAULT ((1)) NOT NULL,
 CONSTRAINT [PK_IP] PRIMARY KEY NONCLUSTERED ([ID] ASC) WITH (FILLFACTOR = 90),
 CONSTRAINT [FK_IP_I] FOREIGN KEY ([IID]) REFERENCES [dbo].[I] ([ID]) ON DELETE CASCADE NOT FOR REPLICATION,
 CONSTRAINT [FK_IP_XType] FOREIGN KEY ([XType]) REFERENCES [dbo].[xTYPE] ([Value]) NOT FOR REPLICATION
);

GO
CREATE CLUSTERED INDEX [IX_IP_CLUST]
 ON [dbo].[IP]([IID] ASC) WITH (FILLFACTOR = 90);

The table "I" has about 100,000 rows, clustered index has 9,386 pages.
The table IP is the "child" – table of I and has about 175,000 rows.

I tried to add a new index following the index column order rule: "WHERE-JOIN-ORDER-(SELECT)"

https://www.mssqltips.com/sqlservertutorial/3208/use-where-join-orderby-select-column-order-when-creating-indexes/

to address the key lookups and create an index seek:

CREATE NONCLUSTERED INDEX [IX_I_Status_1]
    ON [dbo].[Invoice]([Status], [ID])

The extracted query immediately used this index. But the original larger query it is part of, didn't. It did not even use it when I forced it to using WITH(INDEX(IX_I_Status_1)).

After a while I decided to try another new index and changed to order of the indexed columns:

CREATE NONCLUSTERED INDEX [IX_I_Status_2]
    ON [dbo].[Invoice]([ID], [Status])

WOHA! This index was used by the extracted query and also by the larger query!

Then I compared the extracted queries IO statistics by forcing it to use [IX_I_Status_1] and [IX_I_Status_2]:

Results [IX_I_Status_1]:

Table 'I'. Scan count 5, logical reads 636, physical reads 16, read-ahead reads 574
Table 'IP'. Scan count 5, logical reads 1134, physical reads 11, read-ahead reads 1040
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0

Results [IX_I_Status_2]:

Table 'I'. Scan count 1, logical reads 615, physical reads 6, read-ahead reads 631
Table 'IP'. Scan count 1, logical reads 1024, physical reads 5, read-ahead reads 1040
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0,  read-ahead reads 0

OK, I could understand that the mega-large-monster query maybe just too complex to make SQL server catch the ideal execution plan and may miss my new index.
But I don't understand why the index [IX_I_Status_2] seems to be more suitable and more efficient for the query.

Since the query first of all filters table I by column STATUS and then joins with table IP, I don't understand why [IX_I_Status_2] is better and used by Sql Server instead of [IX_I_Status_1]?

Best Answer

Is the WHERE-JOIN-ORDER-(SELECT) rule for index column order wrong?

At the least it is incomplete and potentially misleading advice (I didn't bother to read the whole article). If you're going to read stuff on the Internet (including this), you should adjust your amount of trust according to how well you already know and trust the author, but always then verify for yourself.

There are a number of "rules of thumb" for creating indexes, depending on the exact scenario, but none are really a good substitute for understanding the core issues for yourself. Read up on the implementation of indexes and execution plan operators in SQL Server, go through some exercises, and come to a good solid understanding of how indexes can be used to make execution plans more efficient. There is no effective shortcut to attaining this knowledge and experience.

In general, I can say that your indexes should most often have columns used for equality tests first, with any inequalities last, and/or provided by a filter on the index. This is not a complete statement, because indexes can also provide order, which may be more useful than seeking directly to one or more keys in some situations. For example, ordering can be used to avoid a sort, to reduce the cost of a physical join option like merge join, to enable a stream aggregate, find the first few qualifying rows quickly...and so on.

I'm being a little vague here, because selecting the ideal index(es) for a query depends on so many factors - this is a very broad topic.

Anyway, it is not unusual to find conflicting signals for the 'best' indexes in a query. For example, your join predicate would like rows ordered one way for a merge join, the group by would like rows sorted another way for a stream aggregate, and finding the qualifying rows using the where clause predicates would suggest other indexes.

The reason indexing is an art as well as science is that an ideal combination is not always logically possible. Choosing the best compromise indexes for the workload (not just a single query) requires analytic skills, experience, and system-specific knowledge. If it were easy, the automated tools would be perfect, and performance-tuning consultants would be much less in demand.

As far as missing index suggestions are concerned: these are opportunistic. The optimizer brings them to your attention when it tries to match predicates and required sort order to an index that does not exist. The suggestions are therefore based on particular matching attempts in the specific context of the particular sub-plan variation it was considering at the time.

In context, the suggestions always make sense, in terms of reducing the estimated cost of data access, according to the optimizer's model. It does not do a wider analysis of the query as a whole (much less the wider workload), so you should think of these suggestions as a gentle hint that a skilled person needs to look at the available indexes, with the suggestions as a starting point (and usually no more than that).

In your case, the (Status) INCLUDE (ID) suggestion probably came about when it was looking at the possibility of a hash or merge join (example later). In that narrow context, the suggestion makes sense. For the query as a whole, maybe not. The index (ID, Status) enables a nested loop join with ID as an outer reference: equality seek on ID and inequality on Status per iteration.

One possible selection of indexes is:

CREATE INDEX i1 ON dbo.I (ID, [Status]);
CREATE INDEX i1 ON dbo.IP (Deleted, OPID, IID) INCLUDE (Q);

...which produces a plan like:

Possible plan

I am not saying these indexes are optimal for you; they happen to work to produce a reasonable-looking plan to me, without being able to see statistics for the tables involved, or the full definitions and existing indexing. Also, I know nothing of the wider workload or real query.

Alternatively (just to show one of the myriad additional possibilities):

CREATE INDEX i1 ON dbo.I ([Status]) INCLUDE (ID);
CREATE INDEX i1 ON dbo.IP (Deleted, IID, OPID) INCLUDE (Q);

Gives:

Alternate plan

Execution plans were generated using SQL Sentry Plan Explorer.