Sql-server – When Azure recommends an index (part of natural key) with all other columns included, should I switch to a clustered index for this natural key

clustered-indexperformancesql server

The table does have an identity key (current CI), but it’s barely used to query. Because the natural key is not ever-increasing I’m afraid of insert performance, fragmentation or other problems that I don’t foresee now.

The table is not wide, with just a few columns. It has about 8 million rows and bringing our site to a halt during peak times. (+1000s concurrent users). The data is not easily cacheable, because it is quite volatile and essential that it’s up to date.

There are a lot of reads on one column of the natural key, but also quite active inserting and updating. Say 8 reads, vs 1 updates vs 1 inserts.

Id (PK)         int
UserId*         int
Key1*           varchar(25)
Key2*           varchar(25)
Key3*           int
LastChanged     datetime2(7)
Value           varchar(25)
Invalid         bit

* this combination is the natural primary key

I need to query the most of the time on:

  • All rows for one UserId (most queried)
  • All rows for a list of UserIds (a lot of rows)
  • All rows for a list of UserIds with Key1 = X
  • All rows for a list of UserIds with Key2 = X
  • All rows for a list of UserIds with Key1 = X and Key2 = X

I know the final answer is always “profile it”, but we are quite on a time constraint here so any guidance or experienced opinions in advance would really be appreciated.

Thanks in advance,

Best Answer

Unfortunately selecting a clustered index isn't just a matter of ever increasing. And insert performance and fragmentation aren't your only concerns. Here are a few things to think about:

Your identity column is ever increasing, your natural key isn't. This will increase your insert time, but inserts are generally pretty quick and in a OLTP system they aren't so much less frequent as less visible. For example doubling the time of a 4ms insert to 8ms and no one will notice. Increase the time of a select statement from 30 seconds to a minute and people will see it.

It's a similar problem with fragmentation. It exists, it can cause problems, but it's a manageable problem at worst.

That said, are these your only indexes? Because the clustered index keys are included in any non-clustered index. So a wider set of keys will actually increase the size of your non-clustered indexes and slow them down. And your identity column is only 4 bytes wide. Your natural key is 58 (assuming it's a unique key otherwise add 4 bytes for the uniquifier).

Ordered range scans can be very helpful, and are one of the big bonuses for the right clustered index. That said, I don't see either of the indexes you mention above as candidates for range scans. Those would be more for

SELECT * FROM table1 WHERE name BETWEEN 'Bob' and 'Mary'

You see them most with date data types, although queries similar to the above example happen too.

Also if you end up using your natural key as an NCI (non clustered index) then you might make sure it's a covering index for your main query(s). In other words it contains (generally by using the INCLUDE statement) any extra columns needed by the query. This means that SQL doesn't have to go back and check the CI (clustered index) and avoids an expensive lookup.

Last but certainly not least, do you have to worry about GDPR? Do you have any data for people who are in the European Union? If you do then you need to worry about natural keys in your scripts. If someone's information ends up in a script then you will need to include that in your plans when you have to remove someone's data.

Here is a summary I wrote a while back. It's more generic but might be helpful to look at:

https://sqlstudies.com/2016/08/29/natural-vs-artificial-primary-keys/

And a few other related posts:

In the end though, you are right. You should test it out. That said, I would personally guess that keeping the identity as your CI is going to be your best bet with the following 2 non clustered indexes (based on the queries you mentioned above):

  • UserId, Key1, Key2
  • UserId, Key2

I didn't include Key3 because you didn't mention it and it adds unnecessary width to your index. Beyond that, given the size of the table, you can include all of the non-key columns. Assuming that you have sufficient space for the indexes (which will be as big as the table is currently) it might improve your queries pretty significantly. Of course any data you don't have to send to your application shouldn't be sent and will improve your performance (and reduce the size of your index if you don't have to add it to the include).