On the SQL Saturday event I've attended, one of the lecturers said it's a always good practice to exclude the clustered key columns from the non-clustered index key definition and include clause, too.
I disagree. Let me explain with the table in your question:
The clustered index is: (RecordID, QuestionID)
and there are many more columns.
Any non-clustered index will also have the clustered key columns, too, appended in the end. So, an index like:
(Pts) is equivalent to: (Pts, RecordID, QuestionID)
and similarly:
(Pts, PtsOf) <->: (Pts, PtsOf, RecordID, QuestionID)
(Pts, RecordId) <->: (Pts, RecordID, QuestionID)
(Pts, QuestionID) <->: (Pts, QuestionID, RecordID)
(Pts) INCLUDE (PtsOf) <->: (Pts, RecordID, QuestionID) INCLUDE (PtsOf)
(QuestionID, RecordID) <->: (QuestionID, RecordID)
For joint tables or like this one that have a composite primary/unique key - whether it is clustered or not, it's very often to have queries that need the (a,b)
index and others that will use better the (b,a)
index, sometimes queries that need both. So, one often needs both of these.
If the composite clustered key has more than two columns - say (a,b,c)
- it's often that you may need an index on (b,c,a)
or in (d,b)
and another on (e,c,a)
(which of course will be equivalent to (b,c,a)
, (d,b,a,c)
and (e,c,a,b)
respectively.) You can't just remove these columns from the definitions of the non-clustered keys because the column order will change.
The suggestion has one good point though. The clustered key columns can be removed from the INCLUDE
part. They are redundant, just noise there.
About the indexes in the question, a non-CI on (Pts, PtsOf)
is equivalent to (Pts, PtsOf, RecordID, QuestionID)
, so it is very different than the original non-CI on (RecordID, QuestionID) INCLUDE (Pts, PtsOf)
. It will use a bit more space than the original and of course these two indexes will be useful for different types of queries.
The (Pts, PtsOf)
will be be useful, for example, for queries with WHERE Pts BETWEEN 0 AND 100
, WHERE Pts = 200 AND PtsOf = 300
, etc.
The (RecordID, QuestionID) INCLUDE (Pts, PtsOf)
is basically a copy of the table with only the 2 clustered key columns and 2 only extra columns (of the many). This is (rarely) useful and it's a form of vertical partitioning. If you often have queries that need all the rows of the table but only these 2 columns, then it's probably one these (rather rare) cases where the extra space and effort to maintain this index is justified.
But you have an index on TweetID. Is that index not used by the joins?
Even a non-clustered index will fragment with insert and update. As that index fragments inserts take longer. How are you managing fill factor and index maintenance on this non-clustered index?
An unused PK does nothing.
TweetID as PK will save some space. What % of rows get added daily. If only 1% rows get added daily you could use a fill factor of 90% and defrag daily or weekly. Start with like 80% or 90% and see how fast it fragments.
If the TweetID are out of order but in the same range then the problem you have there is fill factor gets hammered in that range and unused in the rest of the range. But a defag is faster as it is only cleaning up that part of the range. In that case you may be better off with 100% fill factor and a regular index reorganize.
There is no cut and dry answer. You need to manage and monitor the index for your data and your workload.
Best Answer
Here is what I have seen based on several years of working with GUIDs as clustered, and nonclustered primary keys...there is no one correct answer here; the big thing really comes down to the access methods that you are going to use to get to the data, and the volume of data that you are going to return.
Yes, you will have the fragmentation problem with GUID clustered indexes that will have to be managed, however, if your access method uses that GUID as the query predicate, and you return multiple columns from that table, you may be better off suffering the fragmentation hit in order to improve the performance and lessen the impact of reading that data back out.
Here's a very basic example that show that there is less work required to get the data out when it's the clustered index, and the predicate used to get at the data:
The things that you need to take into account, and think about are going to be:
Sadly there is no single correct answer for this, you can only get the appropriate expectations, and performing extensive testing to see if those expectations can be met with whatever data model you design, and if not, look to make adjustments to improve them.
It is very difficult to say. The only real disadvantage is that, if most of your queries come in via the nonclustered PK, and you are very read heavy you are going to see a great deal of increased IO as you will be performing key lookups on every call.
If your storage can withstand that, then it's great. And on the upside, you don't have the fragmentation problem which can lead to index maintenance hell over time (and can mean that you can't rebuild these things if you are running AGs and don't have maintenance periods).