In a general sense, the Clustered Index should be what is used most often to access and sort the data. A major factor in deciding what to use for a Clustered Index is that the key field(s) will be copied into all Nonclustered Indexes on that table.
The main concern for choosing a PK is finding the field(s) that uniquely identify a row, never being NULL, and preferably never changing. And given that this field will be copied into related table so that the child tables can be Foreign Keyed back to this PK, you generally want something smaller as any large field will have a multiplying affect on disk space. An additional consideration to keep in mind for string fields is the Collation since you are not going to want to JOIN on a field while doing a case-insensitive comparison, as that will be far less efficient than using either a case-sensitive collation or even better would be a binary collation (though the best option is still usually one of the integer types: TINYINT
, SMALLINT
, INT
, or BIGINT
).
That all being said, this question is nearly identical to this one:
Primary Key choice on table with unique identifier
And my answer would generally be the same.
BUT, there is one very important difference here: an index can only hold 900 bytes, hence your NVARCHAR(1000)
is invalid for being indexed. Unless, of course, the data is never more than 450 characters, but in that case, the field should be NVARCHAR(450)
instead of NVARCHAR(1000)
.
So, create an INT NOT NULL IDENTITY(1, 1)
{TableName}ID field and use that as the Clustered PRIMARY KEY
.
If you are curious about the total impact of using a large field in either a Clustered Index or Primary Key (given that those values are copied into Nonclustered Indexes and/or child tables), I wrote an article detailing the down-stream effects of this decision:
Disk Is Cheap! ORLY?
On the SQL Saturday event I've attended, one of the lecturers said it's a always good practice to exclude the clustered key columns from the non-clustered index key definition and include clause, too.
I disagree. Let me explain with the table in your question:
The clustered index is: (RecordID, QuestionID)
and there are many more columns.
Any non-clustered index will also have the clustered key columns, too, appended in the end. So, an index like:
(Pts) is equivalent to: (Pts, RecordID, QuestionID)
and similarly:
(Pts, PtsOf) <->: (Pts, PtsOf, RecordID, QuestionID)
(Pts, RecordId) <->: (Pts, RecordID, QuestionID)
(Pts, QuestionID) <->: (Pts, QuestionID, RecordID)
(Pts) INCLUDE (PtsOf) <->: (Pts, RecordID, QuestionID) INCLUDE (PtsOf)
(QuestionID, RecordID) <->: (QuestionID, RecordID)
For joint tables or like this one that have a composite primary/unique key - whether it is clustered or not, it's very often to have queries that need the (a,b)
index and others that will use better the (b,a)
index, sometimes queries that need both. So, one often needs both of these.
If the composite clustered key has more than two columns - say (a,b,c)
- it's often that you may need an index on (b,c,a)
or in (d,b)
and another on (e,c,a)
(which of course will be equivalent to (b,c,a)
, (d,b,a,c)
and (e,c,a,b)
respectively.) You can't just remove these columns from the definitions of the non-clustered keys because the column order will change.
The suggestion has one good point though. The clustered key columns can be removed from the INCLUDE
part. They are redundant, just noise there.
About the indexes in the question, a non-CI on (Pts, PtsOf)
is equivalent to (Pts, PtsOf, RecordID, QuestionID)
, so it is very different than the original non-CI on (RecordID, QuestionID) INCLUDE (Pts, PtsOf)
. It will use a bit more space than the original and of course these two indexes will be useful for different types of queries.
The (Pts, PtsOf)
will be be useful, for example, for queries with WHERE Pts BETWEEN 0 AND 100
, WHERE Pts = 200 AND PtsOf = 300
, etc.
The (RecordID, QuestionID) INCLUDE (Pts, PtsOf)
is basically a copy of the table with only the 2 clustered key columns and 2 only extra columns (of the many). This is (rarely) useful and it's a form of vertical partitioning. If you often have queries that need all the rows of the table but only these 2 columns, then it's probably one these (rather rare) cases where the extra space and effort to maintain this index is justified.
Best Answer
While missing indexes could be helpful and could definitely work, I would not spend too much time on missing indexes, these hints are created on the estimated execution plan, not on the actual execution plan.
More precisely, these index hints are based on the premise of reducing the cost of Query Bucks™ used by operators in the plan. The optimizer calculates the estimated costs, and adds missing index hints accordingly.
As a result they could be very wrong. If you are unsure if it is going to help, the best thing to do is test the situation before and after. You could do this by adding the statement
SET STATISTICS IO, TIME ON;
before running the query.Also, you could use statisticsparser to make it easier to read these statistics.
That is correct, creating the missing index can improve the selectivity on queries, for example if your query looks like this:
or like this:
The reasoning behind this is that both indexes could seek on RetailerID, that part is not going to change. But what if extra filters/ordering is applied on RelationType? It would be all over the place in the clustered index, as a result of it being the third key value, not the second key value. And as we know, it is the second key value in the NCI.
Okay, but when or how would the nonclustered index improve the query?
A couple of cases could be:
NCI Side note
As a side note, adding the key columns to the include list in your NCI is not exactly needed, since CI key columns are automatically included in all Non clustered indexes.
You could opt to do so if you are not sure if the clustered index will remain the same, and want the column to always be included.
Regarding the query itself, if you added the execution plan via PasteThePlan we could give some more information on indexing / improving the query.
Testing
Create table and add some rows
Query #1
Plan without index Here
While it is doing a seek, it is doing a seek on RetailerID. Afterwards it is issueing a residual I/O predicate on RelationType
Add the index
The residual predicate is gone, everything happens in a seek predicate, on both columns.
Execution plan
With the second query, the added index helpfullness becomes even more obvious:
Plan without the index, with a Sort operator:
Plan with the index, using the index removes the sort operator