Sql-server – Missing Non Clustered Index already part of Clustered Index

execution-planindexsql serversql-server-2016

I'm debugging a slow running query and in the execution plan a non-clustered index is suggested, with 51.6648 Impact. However, the non-clustered index only includes columns that are already in the Primary Key (PK) Composite Clustered Index.

Could this be because of the order of the columns in the index? i.e. if the columns in the clustered index are not in order from most selective to least then is there potential for a non-clustered index to improve performance?

In addition the non-clustered index only contains two of the three PK columns with the third added as an included column. Is the include another reason why use of the non-clustered index could be more optimal?

Below is an example of the table structures I am working with:

Tables-

Retailers (
    RetailerID int PK, 
    name ...)

Retailer_Relation_Types (
    RelationType smallint PK, 
    Description nvarchar(50) ...)

Retailer_Relations (
    RetailerID int PK FK, 
    RelatedRetailerID int PK FK, 
    RelationType smallint PK FK, 
    CreatedOn datetime ...)

The table Retailer_Relations has the following composite PK index and suggested index-

CONSTRAINT PK_Retailer_Relations 
PRIMARY KEY CLUSTERED (
    RetailerID ASC, 
    RelatedRetailerID ASC, 
    RelationType ASC
    ) ON [PRIMARY]

CREATE NONCLUSTERED INDEX <NameOfIndex> 
ON Retailer_Relations (
    RetailerID, 
    RelationType
    ) 
INCLUDE (
    RelatedRetailerID
    )

Best Answer

The table Retailer_Relations has the following composite PK index and suggested index-

While missing indexes could be helpful and could definitely work, I would not spend too much time on missing indexes, these hints are created on the estimated execution plan, not on the actual execution plan.

More precisely, these index hints are based on the premise of reducing the cost of Query Bucks™ used by operators in the plan. The optimizer calculates the estimated costs, and adds missing index hints accordingly.

As a result they could be very wrong. If you are unsure if it is going to help, the best thing to do is test the situation before and after. You could do this by adding the statement SET STATISTICS IO, TIME ON; before running the query.

Also, you could use statisticsparser to make it easier to read these statistics.

Could this be because of the order of the columns in the index?

That is correct, creating the missing index can improve the selectivity on queries, for example if your query looks like this:

SELECT  RelatedRetailerID
FROM Retailer_Relations 
WHERE
RetailerID = 5 AND
RelationType = 20;

or like this:

SELECT  RelatedRetailerID
FROM Retailer_Relations 
ORDER BY
RetailerID,
RelationType;

The reasoning behind this is that both indexes could seek on RetailerID, that part is not going to change. But what if extra filters/ordering is applied on RelationType? It would be all over the place in the clustered index, as a result of it being the third key value, not the second key value. And as we know, it is the second key value in the NCI.

Okay, but when or how would the nonclustered index improve the query?

A couple of cases could be:

If relationType filters a lot of values, the residual I/O could be high, resulting in the possible need of the nonclustered index (Query #1)
Ordering on the two columns occurs (One way), and the resultset is large (Query #2).
As @AaronBertrand mentioned: if the CI size difference compared to the NCI is of a considerable amount, adding the NCI will reduce the pages read by queries that benefit from it.

NCI Side note

As a side note, adding the key columns to the include list in your NCI is not exactly needed, since CI key columns are automatically included in all Non clustered indexes.

You could opt to do so if you are not sure if the clustered index will remain the same, and want the column to always be included.

Regarding the query itself, if you added the execution plan via PasteThePlan we could give some more information on indexing / improving the query.

Testing

Create table and add some rows

CREATE TABLE Retailer_Relations (
    RetailerID int , 
    RelatedRetailerID int , 
    RelationType smallint, 
    CreatedOn datetime,
    CONSTRAINT PK_Retailer_Relations 
PRIMARY KEY CLUSTERED (
    RetailerID ASC, 
    RelatedRetailerID ASC, 
    RelationType ASC
    ) ON [PRIMARY])


    DECLARE @I Int = 1
    WHILE @I < 1000
    BEGIN
    INSERT INTO Retailer_Relations(RetailerID,RelatedRetailerID,RelationType,CreatedOn)
    VALUES(@I,@I,@I,GETDATE()
    )
    set @I += 1
    END

Query #1

    SELECT  RelatedRetailerID
FROM Retailer_Relations 
WHERE
RetailerID = 5 AND
RelationType = 20;

Plan without index Here

While it is doing a seek, it is doing a seek on RetailerID. Afterwards it is issueing a residual I/O predicate on RelationType

Add the index

CREATE NONCLUSTERED INDEX IX_TEST
ON Retailer_Relations (
    RetailerID, 
    RelationType
    ) 
INCLUDE (
    RelatedRetailerID
    )

The residual predicate is gone, everything happens in a seek predicate, on both columns.

Execution plan

With the second query, the added index helpfullness becomes even more obvious:

SELECT  RelatedRetailerID
FROM Retailer_Relations 
ORDER BY
RetailerID,
RelationType;

Plan without the index, with a Sort operator:

Plan with the index, using the index removes the sort operator

Related Solutions

Sql-server – Is Primary Key Non Clustered recommended

In a general sense, the Clustered Index should be what is used most often to access and sort the data. A major factor in deciding what to use for a Clustered Index is that the key field(s) will be copied into all Nonclustered Indexes on that table.

The main concern for choosing a PK is finding the field(s) that uniquely identify a row, never being NULL, and preferably never changing. And given that this field will be copied into related table so that the child tables can be Foreign Keyed back to this PK, you generally want something smaller as any large field will have a multiplying affect on disk space. An additional consideration to keep in mind for string fields is the Collation since you are not going to want to JOIN on a field while doing a case-insensitive comparison, as that will be far less efficient than using either a case-sensitive collation or even better would be a binary collation (though the best option is still usually one of the integer types: TINYINT, SMALLINT, INT, or BIGINT).

That all being said, this question is nearly identical to this one:

Primary Key choice on table with unique identifier

And my answer would generally be the same.

BUT, there is one very important difference here: an index can only hold 900 bytes, hence your NVARCHAR(1000) is invalid for being indexed. Unless, of course, the data is never more than 450 characters, but in that case, the field should be NVARCHAR(450) instead of NVARCHAR(1000).

So, create an INT NOT NULL IDENTITY(1, 1) {TableName}ID field and use that as the Clustered PRIMARY KEY.

If you are curious about the total impact of using a large field in either a Clustered Index or Primary Key (given that those values are copied into Nonclustered Indexes and/or child tables), I wrote an article detailing the down-stream effects of this decision:

Disk Is Cheap! ORLY?

Sql-server – Excluding clustered key columns from non-clustered indexes definitions

On the SQL Saturday event I've attended, one of the lecturers said it's a always good practice to exclude the clustered key columns from the non-clustered index key definition and include clause, too.

I disagree. Let me explain with the table in your question:

The clustered index is: (RecordID, QuestionID) and there are many more columns.

Any non-clustered index will also have the clustered key columns, too, appended in the end. So, an index like:

(Pts)   is equivalent to:   (Pts, RecordID, QuestionID)

and similarly:

(Pts, PtsOf)           <->:  (Pts, PtsOf, RecordID, QuestionID)

(Pts, RecordId)        <->:  (Pts, RecordID, QuestionID)

(Pts, QuestionID)      <->:  (Pts, QuestionID, RecordID)

(Pts) INCLUDE (PtsOf)  <->:  (Pts, RecordID, QuestionID) INCLUDE (PtsOf) 

(QuestionID, RecordID) <->:  (QuestionID, RecordID)

For joint tables or like this one that have a composite primary/unique key - whether it is clustered or not, it's very often to have queries that need the (a,b) index and others that will use better the (b,a) index, sometimes queries that need both. So, one often needs both of these.

If the composite clustered key has more than two columns - say (a,b,c) - it's often that you may need an index on (b,c,a) or in (d,b) and another on (e,c,a) (which of course will be equivalent to (b,c,a), (d,b,a,c) and (e,c,a,b) respectively.) You can't just remove these columns from the definitions of the non-clustered keys because the column order will change.

The suggestion has one good point though. The clustered key columns can be removed from the INCLUDE part. They are redundant, just noise there.

About the indexes in the question, a non-CI on (Pts, PtsOf) is equivalent to (Pts, PtsOf, RecordID, QuestionID), so it is very different than the original non-CI on (RecordID, QuestionID) INCLUDE (Pts, PtsOf). It will use a bit more space than the original and of course these two indexes will be useful for different types of queries.

The (Pts, PtsOf) will be be useful, for example, for queries with WHERE Pts BETWEEN 0 AND 100, WHERE Pts = 200 AND PtsOf = 300, etc.
The (RecordID, QuestionID) INCLUDE (Pts, PtsOf) is basically a copy of the table with only the 2 clustered key columns and 2 only extra columns (of the many). This is (rarely) useful and it's a form of vertical partitioning. If you often have queries that need all the rows of the table but only these 2 columns, then it's probably one these (rather rare) cases where the extra space and effort to maintain this index is justified.

Best Answer

Related Solutions

Sql-server – Is Primary Key Non Clustered recommended

Sql-server – Excluding clustered key columns from non-clustered indexes definitions

Related Question