Sql-server – Which columns to place the clustered index

clustered-indexsql serversql-server-2012

I have a system that has been up-and-running for almost 3 years now. Previously (and currently) the front-end was using NHibernate for all of the database access.

Right now, I'm in the process of converting over to Dapper and utilizing stored procedures for EVERYTHING (reads, writes, etc).

Now that NHibernate is gone, I don't believe the current structure of tablename, tablenameid (clustered primary key) is optimal anymore for every table. Lots of my tables will NEVER be accessed via the primary key, UNLESS it's for a delete (even that can be avoided).

So, here are a couple examples of table structures, and which fields I believe should be the clustered index.

Example 1:

Event table
------------------------------------------
EventId int primary key clustered,
Season int not null,
EventDate datetime not null,
EventName varchar(100) not null,
EventType varchar(10) not null,-- (soon to be it's own table, but not right now...there are only 2 possible values, Tournament or Dual)
SchoolId int not null,
OpponentId int NULL,
etc...

So currently, I have the clustered index on the primary key. This is one table that is accessed 99% of the time by Season, and SchoolId. (Very rare that the EventId is used, when it is, it's for a delete)

Here's where the trickiness comes in. If the EventType is 'Dual', then uniqueness is by Season, EventDate, SchoolId, OpponentId. If the EventType is 'Tournament', Then uniqueness is on Season, EventDate, SchoolId, EventName (OpponentId will be null).

Under this architecture, I don't believe I can/should have a unique key, which isn't really my issue…

Am I safe to assume that the clustered index for this table should be on the Season and SchoolId columns?

Example 2:

WrestlerRanking table
-----------------------------
WrestlerRankingId int primary key clustered,
Season int not null,
Week int not null,
IsCurrent bit not null,
WrestlerId int not null,
etc...

In this scenario, similar to the one above, 99% of the this table is accessed by Season, WrestlerId, and IsCurrent. Uniqueness can be set by Season, Week, and WrestlerId.

Would/should the clustered index for this table be on Season and WrestlerId, even though the majority of joins to this table will include the IsCurrent column? This is definitely not unique as there could be around 20 records for the Season and WrestlerId combination (20 different weeks of rankings).

Best Answer

Here's where the trickiness comes in. If the EventType is 'Dual', then uniqueness >is by Season, EventDate, SchoolId, OpponentId. If the EventType is 'Tournament', >Then uniqueness is on Season, EventDate, SchoolId, EventName (OpponentId will be >null).

I'd look at what's common between these two queries to determine what columns to use in the Clustered Index. Remember, the Clustered index is effectively how the data is arranged on the disk, so it should have a structure similar to the hierarchy that queries against the table will reference. If every query, or the most critical ones, all look at Season, SchoolId and EventDate regardless of EventType, then I'd use those in the Clustered index. Additional Non-clustered covering indexes can be created for specific queries (around EventType or OpponentId, for example) which INCLUDE the columns returned to prevent key lookups.

Clustered indexes do not need to be unique, they need to specify how the data should be arranged so that queries can quickly seek to the desired location. The primary key should be unique, but it doesn't have to be a clustered index. If it's just a reference for a row value that doesn't have any useful information that queries will use (i.e. you never join on it), then it should not be part of the clustered index.

For the second example, I would probably use Season, Week, and WrestlerId if those are most queried. The order should be in the logical heirarchy that the data is queried in...which brings me to my next point:

All this advice is contingent on the execution plans of your queries.

Always, ALWAYS, use actual execution information to determine what the best indexes for your queries are. Got some table scan issues? Implement a clustered index. Got a Clustered index scan slowing you down? Look into implementing a non-clustered covering index for that query based on the columns it's looking up (often suggested by the DB engine itself). Got key lookup? Try using an INCLUDES on the affected index with the looked-up columns. Indexing hypothetically, before you're super familiar with the DB engine, often leads to wasted time and rework. I'm not saying to shut your brain off and just blindly follow the query plan's suggested indexes, but instead use the query plan to build your indexing strategy rather than trying to preempt it, or second guess it.