Sql-server – Is ‘Avoid creating a clustered index based on an incrementing key’ a theth from SQL Server 2000 days

clustered-indexsql serversql-server-2005

Our databases consist of lots of tables, most of them using an integer surrogate key as a primary key. About half of these primary keys are on identity columns.

The database development started in the days of SQL Server 6.0.

One of the rules followed from the beginning was, Avoid creating a clustered index based on an incrementing key, as you find in these Index Optimization Tips.

Now using SQL Server 2005 and SQL Server 2008, I have the strong impression that the circumstances changed. Meanwhile, these primary key columns are perfect first candidates for the clustered index of the table.

Best Answer

The myth goes back to before SQL Server 6.5, which added row level locking. And hinted at here by Kalen Delaney.

It was to do with "hot spots" of data page usage and the fact that a whole 2k page (SQL Server 7 and higher use 8k pages) was locked, rather then an inserted row Edit, Feb 2012

Found authoritative article by Kimberly L. Tripp

"The Clustered Index Debate Continues..."

Hotspots were something that we greatly tried to avoid PRIOR to SQL Server 7.0 because of page level locking (and this is where the term hot spot became a negative term). In fact, it doesn't have to be a negative term. However, since the storage engine was rearchitected/redesigned (in SQL Server 7.0) and now includes true row level locking, this motivation (to avoid hotspots) is no longer there.

Edit, May 2013

The link in lucky7_2000's answer seems to say that hotspots can exist and they cause issues. However, the article uses a non-unique clustered index on TranTime. This requires a uniquifier to be added. Which means the index in not strictly monotonically increasing (and too wide). The link in that answer does not contradict this answer or my links

On a personal level, I have woked on databases where I inserted tens of thousands of rows per second into a table that has a bigint IDENTITY column as the clustered PK.

Related Solutions

Sql-server – SQL Server Primary key / clustered index design decision

You are correct to separate "clustered index" from "primary key":

A clustered index is the organisation of data on disk is better if
- narrow
- numeric
- increasing (strictly monotonic)
The primary key identifies a row

Note: GUIDs make poor clustering keys

In this case, with the surrogate column, the table has 2 candidate keys:

ProductHistoryID
ProductNo + CreatedDateTime

Assumed convention states that the ProductHistoryID becomes the PK, but you can leave the PK on (ProductNo, CreatedDateTime): it will just be non-clustered. Which leads to indexes:

clustered index should be on ProductHistoryID
unique non-clustered index on (ProductNo, CreatedDateTime)

Example

CREATE TABLE Product (
    ProductHistoryID int NOT NULL IDENTITY (1,1) NOT NULL,
    ProductNo ...
    CreatedDateTime ...

then you a choice of

    CONSTRAINT PK_Product PRIMARY KEY CLUSTERED (ProductHistoryID)
    CONSTRAINT UQ_Product UNIQUE NONCLUSTERED (ProductHistoryID)

    CONSTRAINT PK_Product PRIMARY KEY NONCLUSTERED (ProductNo, CreatedDateTime)
    CONSTRAINT PK_Product UNIQUE CLUSTERED (ProductHistoryID)

Also, the pattern you have is a "type 2 Slowly Changing Dimension"

Sql-server – SQL Server clustered index high fragmentation

FILLFACTOR only applies when you build or rebuild an index, not during normal operation. Normal operations always try to fill the pages to 100%.

If you insert a row that has a variable width, then update the row to be longer, that row will no longer fit on the page if there isn't enough extra space to store the after-image on the same page. If there isn't enough space, this will cause a page split, which is the process that creates the necessary space.

What's a bit misleading about page splits is that there are "good splits" and "bad splits," even though the performance counter counts all of them.

A good split is when new rows are added to the end of the index, like what happens when you do your initial batch of INSERTs. The new row doesn't fit on the last page, so the storage engine must allocate a new page and logically hook it up to the last index page. The new page will probably exist physically after the old-last index page.
A bad split is when a page must be inserted in the middle of an index: the new page is attached to the index structure logically on both sides, but may (likely) not exist in contiguous physical order to those pages.

Fragmentation is the discrepancy between the logical and physical order, and for the most part, only the bad type of page splits cause fragmentation.

Because you're inflating the size of already-existing rows, this causes the bad type of page split, which is why you're seeing high fragmentation numbers.

It's unclear what your exact process is, and how many rows are ultimately going to end up in this table. If it's something like a one-time population, do the full population process, then turn around and rebuild the clustered index with 100% FILLFACTOR.

If this process happens continuously, you could "pre-allocate" the space by doing the following: add a dummy variable-width column to the table, populate it to the max length on INSERT, and then on the first UPDATE, set the dummy value to NULL while updating the real value. This method will probably add overhead to the logging mechanism because of all the data values flying around.

In general, though, you only need to concern yourself with fragmentation when all of these factors are true:

The index is large enough that randomly reading the pages (extents) is too slow (choose your own definition of what "too slow" means)
The index will be scanned
The index pages (extents) will be physically read from disk (i.e., they are not already in the buffer pool)
The storage subsystem is poor at handling random reads (i.e., spindle-based, and the bytes won't come out of cache)

I would recommend, however, putting in place a solid index maintenance solution to keep things more or less in check. You can search this site for recommendations with regards to that.

Best Answer

Related Solutions

Sql-server – SQL Server Primary key / clustered index design decision

Sql-server – SQL Server clustered index high fragmentation

Related Question