Sql-server – Primary Key choice on table with unique identifier

identityprimary-keysql server

I'm reconsidering my choice of Primary Key on a table, and would like some input.

Some context: We are building a REST API, where clients access "Assets" identified by a GUID. So, the table for Asset in our database resembles this:

CREATE TABLE [dbo].[Asset](
    [Id] [uniqueidentifier] NOT NULL,
    [TypeId] [int] NOT NULL,
    [Date] [datetime] NOT NULL,
    [Title] [varchar](1000) NULL,
    [Description] [varchar](max) NULL,
    [Created] [datetime] NOT NULL,
    [Modified] [datetime] NOT NULL,
    [PublisherFields] [xml] NULL,
    [StatusId] [int] NOT NULL,
 CONSTRAINT [PK_Asset] PRIMARY KEY CLUSTERED 
(
    [Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
)

We are experiencing a few problems with this structure, notably that the clustered index on the table becomes fragmented very quickly, and we have now reached the point where we are no longer able to rebuild it (because of Azure limitations on query time limits).

Further research suggests that using a uniqueidentifier as a clustered index is not always a good choice, because of the way that newid() doesn't generate IDs in sequence (unless you use newsequentialid(), but Azure doesn't allow this either).

The simple solution seems to be to introduce a dummy column and add the clustered index to that. Design guidelines say that good clustered indexes should be sequential and non-changing, so the simplest answer to that seems to be an Identity column. This would make my table into this:

CREATE TABLE [dbo].[Asset](
        [Id] [bigint] IDENTITY(1,1), <---- NEW
        [AssetId] [uniqueidentifier] NOT NULL, <---- Renamed
        [TypeId] [int] NOT NULL,
        [Date] [datetime] NOT NULL,
        [Title] [varchar](1000) NULL,
        [Description] [varchar](max) NULL,
        [Created] [datetime] NOT NULL,
        [Modified] [datetime] NOT NULL,
        [PublisherFields] [xml] NULL,
        [StatusId] [int] NOT NULL,
     CONSTRAINT [PK_Asset] PRIMARY KEY NONCLUSTERED <--- Now NonClustered
    (
        [Id] ASC
    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
    )

The clustered index would also be added to the ID column. Does this seems a sensibible alternative? I might question the sense in having an IDENTITY column on the table that is not a primary key, but my reasoning is that an IDENTITY column is used to obtain the auto-incrementing nature, which suits the clustered index. It might also be confusing…when someone references an "AssetID", which column are they talking about? The [AssetId] column, or the Asset table's [ID] column?

Suggestions and alternatives welcomed.

Best Answer

There are a few things to consider here:

  1. How is data most often looked up in this table?
  2. How is data most often sorted in this table?
  3. Does this table relate to others as the parent record (i.e. will other tables FK to the PK of this table)?

Also, please keep in mind:

  1. The key fields of a Clustered Index are copied into the non-clustered indexes on the same table
  2. For Clustered Indexes that are not a Primary Key (implied unique) or at least declared as UNIQUE, a hidden "uniqueifier" field is added to rows that would otherwise be duplicates.
  3. NEWSEQUENTIALID() is sequential per each restart of the SQL Server service. It is possible that the starting value after a restart is less than the previous lowest value.

Hence:

  • If the PK field of this table shows up as a FK field in other tables, there is a definite performance implication of choosing to use a UNIQUEIDENTIFIER instead of an INT as the FKed tables would then have a larger FK field in them.
  • How many rows do you really expect to have in this table? INT is 4 bytes (compared to the 8 bytes of a BIGINT) and has a max value of 2,147,483,647. If you might have slightly over 2.14 billion items, you can also start the IDENTITY range at the min value of each datatype, which for INT is -2,147,483,648. Starting at the low-end gives you the full 4.294 billion values to use. Compared to the 8 bytes of the DATETIME field if going with the Created field, plus add in the size of another field to make it unique, or the uniqueifier for any duplicate rows.
  • Since the key field(s) of the Clustered Index are included in Nonclustered indexes, that increases the chances of having a covering index without needing to INCLUDE other columns. Meaning, if you have the Clustered Index on the INT PK and a Nonclustered index on the UNIQUEIDENTIFIER, then JOINing to another table on that INT PK field while specifying the GUID value in a WHERE clause (assuming no other fields from this table are in the query) won't have to go back to the table since the Nonclustered Index will have both of the requiered fields in it. Does the Created field give the same benefit? Likely not.

IF no other tables ever JOIN to this table:

  • THEN it might be ok to use the Created field as the non-unique Clustered Index and the Id field as the Nonclustered PK.
  • ELSE it is typically best to add an INT (unless you need more than 4.294 billion values) IDENTITY field, AssetId, as the Clustered PK, and the Id UNIQUEIDENTIFIER field as a Nonclustered Index. Since you likely already have code referencing the UNIQUEIDENTIFIER field as Id, I wouldn't change that name.