Sql-server – Should the index on an identity column be nonclustered

database-internalsheapindex-tuningperformanceperformance-tuningsql server

For a table with identity column, should a clustered or non-clustered PK/unique index be created for the identity column?

The reason is other indexes will be created for queries. A query which uses a nonclustered index (on a heap) and returns columns that are not covered by the index will use less logical I/O (LIO) because there are no extra clustered index b-tree seek steps?

create table T (
  Id int identity(1,1) primary key, -- clustered or non-clustered? (surrogate key, may be used to join another table)
  A .... -- A, B, C have mixed data type of int, date, varchar, float, money, ....
  B ....
  C ....
  ....)

create index ix_A on T (A)
create index ix_..... -- Many indexes can be created for queries

-- Common query is query on A, B, C, ....
select A, B 
from T 
where A between @a and @a+5 -- This query will have less LIO if the PK is non-clustered (seek)

select A, B, C
from T 
where B between @a and @a+5 

....

Clustered PK on identity column is good because:

It increase monotonously so no page splits when inserting. It's said a bulk insert can be as fast as on a heap (nonclustered) table
It's narrow

However, will the queries in the question be faster without setting it clustered?

** Update:**
What if the Id is the FK of other tables and it will be joined in some queries?

Best Answer

By default the PK is clustered and in most cases, this is fine. However, which question should be asked:

should my PK be clustered?
which column(s) will be the best key for my clustered index?

PK and Clustered index are 2 differences things:

PK is a constraint. PK is used to uniquely identify rows, but there is no notion of storage. However by default (in SSMS), it is enforced by a unique clustered index if a clustered index is not yet present.
Clustered indexes is a special type of index which store row data at the leaf level, meaning it is always covering. All columns whether they are part of the key or not, are stored at the leaf level. It does not have to be unique, in which case a uniquifier (4 bytes) is added to the clustered key.

Now we end up with 2 questions:

How do I want to uniquely identify rows in my table (PK)
How do I want to store it at the leaf level of an index (Clustered Index)

It depends on how:

you design your data model
you query your data and you write your queries
you insert or update your data
...

First, do you need a clustered index? If you bulk insert, it is more efficient to store unordered data to a HEAP (versus ordered data in a cluster). It uses RID (Row Identifier, 8 bytes) to uniquely identify rows and store it on pages.

The clustered index should not be a random value. The data at the leaf level will be stored and ordered by the index key. Therefore it should grow continuously in order to avoid fragmentation or page split. If this can not be achieved by the PK, you should consider another key as a clustered candidate. Clustered index on identy columns, sequential GUID or even something like the insertion's date is fine from a sequential point of view since all rows will be added to the last leaf page. On the other hand, while unique identifier may be useful to your business needs as a PK, they should not be clustered (they are randomly ordered/generated).

If after some data and query analysis, you find out that you mostly use the same index to get your data before doing a key lookup in the clustered PK, you may consider it as clustered index although it may not uniquely identify your data.

The clustered index key is composed of all the columns you want to index. A uniquefier column (4 bytes) is added if there is no unique constraint on it (incremental value for duplicates, null otherwise). This index key will then be stored once for each row at the leaf level of all your nonclustered indexes. Some of them will also be stored several times at intermediate levels (branch) between the root and the leaf level of the index tree (B-tree). If the key is too large, all the non clustered index will get larger, will require more storage and more IO, CPU, memory, ... If you have a PK on name+birthdate+country, it is very likely that this key is not a good candidate. It is too large for a clustered index. Uniqueidentifier using NEWSEQUENTIALID() is usually not considered as a narrow key (16 bytes) although it is sequential.

Then once you figured out how to uniquely identify rows in your table, you can add a PK. If you think you won't use it in your query, don't create it clustered. You can still create another nonclustered index if you sometime need to query it. Note that the PK will automaticaly create a unique index.

The non clustered indexes will always contain the clustered key. However, if the indexed columns (+key columns) are covering, there won't be any key lookup in the clustered index. Don't forget you can also add Include and Where to a non clustered index. (use it wisely)

Clustered index should be unique and as narrow as possible Clustered index should not change over time and should inserted incrementally.

It is now time to write some SQL which will create the table, clustered and nonclustered indexes and constraints.

This is all theoritical because we don't know your data model and datatypes used (A and B).

Related Solutions

Sql-server – Would a table benefit if it was a heap

This started as a comment/questions but it got to long so I moved it here:

I'm really thrown by this question. 1.5mil rows isn't really all that big. And the point behind an identity is that it's ever increasing. If that's your CL you shouldn't be doing inserts into the middle of a page, certainly not often enough to cause the level of fragmentation you're seeing.

Couple of questions:

Are you doing IDENTIY_INSERTS? Basically specifying what the identity value should be? Or have you re-set the identity at some point so that you are inserting into the middle of the range?

Typically if you are doing inserts it looks like this:

5 6 7 8 < Next insert goes here >

But if you have something like this (assume your next identity value is 4)

 1 2 3 < Next insert goes here > 100 101

Then you could be seeing quite a few page splits. But in the normal course of things you shouldn't be.

Is there any chance you are shrinking your database? Auto_shrink or a maint plan/job that does shrinks? If so it's the shrink that's causing your fragmentation not the clustered index.

In general there is nothing wrong with a HEAP and they can be faster for INSERTs. My biggest concern with them tends to be if you are doing large numbers of deletes or updates (which you say you aren't). In those cases you can get a space leak and end up with a table that is multiple GBs in size but has 0 rows.

Actual answer

Given you have a log file, and are only ever inserting, you could try dropping the PK and see how performance goes (in a test environment first of course). Once you've run some tests using your workload and seeing how things go then make your change in production and monitor there for a while. You might even consider dropping the identity column entirely.

Do check that SHRINK thing though. That's a killer.

Sql-server – Will a nonclustered index with a unique column always address all queries filtering that column first

Yes, given the constraints in the question, particularly that the primary key column is the leading column in the indexes. Also assuming the primary key never changes.
Not necessarily.

The optimizer can indeed infer uniqueness without marking the nonclustered index unique.

Marking the index unique may introduce a Split-Sort-Collapse combination in execution plans that change an index key. The extra Sort in particular has the potential to be performance-affecting.

On the other hand, not marking the index unique risks data integrity if the primary key is ever changed.

Example

CREATE TABLE dbo.Test
(
    pk integer PRIMARY KEY NONCLUSTERED,
    c1 integer NOT NULL,
    c2 integer NOT NULL
);

-- Not unique on pk, c1
CREATE NONCLUSTERED INDEX ic1 ON dbo.Test (pk, c1);

-- Unique on pk, c2
CREATE UNIQUE NONCLUSTERED INDEX ic2 ON dbo.Test (pk, c2);

Uniqueness

-- Neither plan has an aggregate
SELECT DISTINCT T.c1 FROM dbo.Test AS T WHERE T.pk = 1;
SELECT DISTINCT T.c2 FROM dbo.Test AS T WHERE T.pk = 1;

Split, Sort, Collapse

-- No split, sort, collapse
UPDATE dbo.Test SET c1 = CHECKSUM(NEWID());

-- Split, sort, collapse updating unique key
UPDATE dbo.Test SET c2 = CHECKSUM(NEWID());

Note the split-sort-collapse plan is also a wide (per-index) update.

Uniqueness is a huge topic though. I would normally mark something that is unique as unique, unless there is a good reason not to. Some further reading from my blog:

To anticipate comments about heap tables: Most tables benefit from being clustered. You need good reasons to choose a heap structure, especially from a space management point of view, if the table ever experiences deletes. Updates can also introduce performance impacts if columns expand beyond the space available on the original page (forwarded records).