Sql-server – Clustered Index fragmentation vs Index with Included columns fragmentation

clustered-indexindexindex-tuningperformancesql server

I have a relative large table (for me) with 40 million rows that is expected to grow to 80 to 120 million rows in a period of two weeks (during a campaign).

Tip
--------------
Id          int (clustered index)
UserId      int
TipIndex    smallint
Value       binary(8)
LastChanged datetime2(3)
  • every user has between 1 to 400 tips that will be inserted at random during this period
  • UserId + TipIndex is unique
  • I never query Tips directly on the Identity Key (Id)
  • 99% of the times I query on UserId
  • I need all columns
  • I often query on 1 UserId (per page view), sometimes on a batches of 10.000s for statistics
  • this is a high traffic site during this period and it should be able to handle 30.000 queries on UserId per minute
  • Id is at the moment my Clustered Index, because I read that it leads to the least fragmentation.

So I'm hosting on SQL Azure and Azure already recommended adding an Index with included columns.
I was always hesitant on using UserId,TipIndex as a Clustered Index, as Tips will be added at random. Meaning that I'm afraid of huge fragmentation problems etc.

My questions:

  • Doesn't an Index with included columns have the exact same problem?
  • Is a table with included columns not just the same as a "shadow table" with the same fragmentation problems?
  • Should I migrate to use UserId, TipIndex as a ClusteredIndex instead of Id?
  • How to prevent fragmentation?

I know in the end the answer is always "depends" or that I should measure it. But as I'm a solo developer and not with a lot of resources I am hoping on someone with more experience that has a gut feeling for this, so my first attempts have a higher chance of going in the right direction.

Best Answer

It seems you are too much concerned about fragmentation, As long as you keep updating statistics regularly, fragmentation shouldn't bother you much for performance. You may read more details about this on a video shared by Mr. Brent Ozar and also another page here. Let me try answering your question one by one:

Doesn't an Index with included columns have the exact same problem?

Index with columns in the include or be it a key column, doesn't differ much. Key columns are going to be part of B-tree whereas include columns are not however when you perform any insert/update/delete operation, this will have same expense as these columns need to be updated/inserted/deleted.

Is a table with included columns not just the same as a "shadow table" with the same fragmentation problems?

Not very sure what you meant by shadow table, if you meant by base table here then yes, you would have same problem as far as fragmentation is concerned.

Should I migrate to use UserId, TipIndex as a ClusteredIndex instead of Id?

As per your statement - "99% of the times I query on UserId", this is a good candidate for primary key clustered column. Since you are not going to use Id column very frequently, I don't see any problem in using composite primary clustered key in the form of UserId and TipIndex. In terms of size of index, it is as good as Id(int --> 4 byte) plus an additional column of tinyint type(1 byte).

Kindly understand that, clustered key is nothing but the order of data stored logically and doesn't have any physical existence unlike non-clustered key.

How to prevent fragmentation?

I would say updating statistics should be considered with priority than that of fragmentation. You may use maintenance script used by many DBAs across the world from Ola Hallengren. You can schedule it weekly or bi-weekly basis your requirement.

Hope this helps.