Sql-server – Primary Key choice on table with unique identifier

identityprimary-keysql server

I'm reconsidering my choice of Primary Key on a table, and would like some input.

Some context: We are building a REST API, where clients access "Assets" identified by a GUID. So, the table for Asset in our database resembles this:

CREATE TABLE [dbo].[Asset](
    [Id] [uniqueidentifier] NOT NULL,
    [TypeId] [int] NOT NULL,
    [Date] [datetime] NOT NULL,
    [Title] [varchar](1000) NULL,
    [Description] [varchar](max) NULL,
    [Created] [datetime] NOT NULL,
    [Modified] [datetime] NOT NULL,
    [PublisherFields] [xml] NULL,
    [StatusId] [int] NOT NULL,
 CONSTRAINT [PK_Asset] PRIMARY KEY CLUSTERED 
(
    [Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
)

We are experiencing a few problems with this structure, notably that the clustered index on the table becomes fragmented very quickly, and we have now reached the point where we are no longer able to rebuild it (because of Azure limitations on query time limits).

Further research suggests that using a uniqueidentifier as a clustered index is not always a good choice, because of the way that newid() doesn't generate IDs in sequence (unless you use newsequentialid(), but Azure doesn't allow this either).

The simple solution seems to be to introduce a dummy column and add the clustered index to that. Design guidelines say that good clustered indexes should be sequential and non-changing, so the simplest answer to that seems to be an Identity column. This would make my table into this:

CREATE TABLE [dbo].[Asset](
        [Id] [bigint] IDENTITY(1,1), <---- NEW
        [AssetId] [uniqueidentifier] NOT NULL, <---- Renamed
        [TypeId] [int] NOT NULL,
        [Date] [datetime] NOT NULL,
        [Title] [varchar](1000) NULL,
        [Description] [varchar](max) NULL,
        [Created] [datetime] NOT NULL,
        [Modified] [datetime] NOT NULL,
        [PublisherFields] [xml] NULL,
        [StatusId] [int] NOT NULL,
     CONSTRAINT [PK_Asset] PRIMARY KEY NONCLUSTERED <--- Now NonClustered
    (
        [Id] ASC
    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
    )

The clustered index would also be added to the ID column. Does this seems a sensibible alternative? I might question the sense in having an IDENTITY column on the table that is not a primary key, but my reasoning is that an IDENTITY column is used to obtain the auto-incrementing nature, which suits the clustered index. It might also be confusing…when someone references an "AssetID", which column are they talking about? The [AssetId] column, or the Asset table's [ID] column?

Suggestions and alternatives welcomed.

Best Answer

There are a few things to consider here:

How is data most often looked up in this table?
How is data most often sorted in this table?
Does this table relate to others as the parent record (i.e. will other tables FK to the PK of this table)?

Also, please keep in mind:

The key fields of a Clustered Index are copied into the non-clustered indexes on the same table
For Clustered Indexes that are not a Primary Key (implied unique) or at least declared as UNIQUE, a hidden "uniqueifier" field is added to rows that would otherwise be duplicates.
NEWSEQUENTIALID() is sequential per each restart of the SQL Server service. It is possible that the starting value after a restart is less than the previous lowest value.

Hence:

If the PK field of this table shows up as a FK field in other tables, there is a definite performance implication of choosing to use a UNIQUEIDENTIFIER instead of an INT as the FKed tables would then have a larger FK field in them.
How many rows do you really expect to have in this table? INT is 4 bytes (compared to the 8 bytes of a BIGINT) and has a max value of 2,147,483,647. If you might have slightly over 2.14 billion items, you can also start the IDENTITY range at the min value of each datatype, which for INT is -2,147,483,648. Starting at the low-end gives you the full 4.294 billion values to use. Compared to the 8 bytes of the DATETIME field if going with the Created field, plus add in the size of another field to make it unique, or the uniqueifier for any duplicate rows.
Since the key field(s) of the Clustered Index are included in Nonclustered indexes, that increases the chances of having a covering index without needing to INCLUDE other columns. Meaning, if you have the Clustered Index on the INT PK and a Nonclustered index on the UNIQUEIDENTIFIER, then JOINing to another table on that INT PK field while specifying the GUID value in a WHERE clause (assuming no other fields from this table are in the query) won't have to go back to the table since the Nonclustered Index will have both of the requiered fields in it. Does the Created field give the same benefit? Likely not.

IF no other tables ever JOIN to this table:

THEN it might be ok to use the Created field as the non-unique Clustered Index and the Id field as the Nonclustered PK.
ELSE it is typically best to add an INT (unless you need more than 4.294 billion values) IDENTITY field, AssetId, as the Clustered PK, and the Id UNIQUEIDENTIFIER field as a Nonclustered Index. Since you likely already have code referencing the UNIQUEIDENTIFIER field as Id, I wouldn't change that name.

Related Solutions

Sql-server – Indexing – Uniqueidentifier Foreign Key or Intermediary mapping table

Ok, I am making a lot of assumptions (INT instead of VARCHAR(50) being one of them) with this answer, so feel free to correct me if needed. The problem with option B is that it introduces a new join to relate Users to Alerts without any real added benefit. If joining on the UserID, it is best to index the UserID, so you can utilize seeks for your joins.

For Option A, UserID will be the clustering key (index key for the clustered index) on the Users table. UserID will be a nonclustered index key on Alerts table. This will cost 16 bytes per Alert.

For Option B, UserID will be the clustering key on the Users table. UserId will probably be the clustering key in UserMap too, to make joining more efficient. UserKey (assuming this is an INT) would then be a nonclustered index key on the Alerts table. This will cost 4 bytes per Alert. And 20 bytes per UserMap.

Looking at the big picture, one relationship, for Option A, costs 16 bytes of storage, and involves 1 join operation. Whereas, one relationship, for Option B, costs 24 bytes of storage, and involves 2 join operations.

Furthermore, there are a possibility of 340,282,366,920,938,000,000,000,000,000,000,000,000 uniqueidentifiers and only 4,294,967,296 INTs. Implementing a uniqueidentifier to INT map for a this type of relationship could cause unexpected results when you start reusing INTs.

The only reason for creating this type map table, is if you plan on creating a Many to Many relationship between Users and Alerts.

Taking all of this into consideration, I would recommend Option A.

I hope this helps,

Matt

Sql-server – Parent-Child Tree Hierarchical ORDER

OK, enough brain cells are dead.

SQL Fiddle

WITH cte AS
(
  SELECT 
    [ICFilterID], 
    [ParentID],
    [FilterDesc],
    [Active],
    CAST(0 AS varbinary(max)) AS Level
  FROM [dbo].[ICFilters]
  WHERE [ParentID] = 0
  UNION ALL
  SELECT 
    i.[ICFilterID], 
    i.[ParentID],
    i.[FilterDesc],
    i.[Active],  
    Level + CAST(i.[ICFilterID] AS varbinary(max)) AS Level
  FROM [dbo].[ICFilters] i
  INNER JOIN cte c
    ON c.[ICFilterID] = i.[ParentID]
)

SELECT 
  [ICFilterID], 
  [ParentID],
  [FilterDesc],
  [Active]
FROM cte
ORDER BY [Level];

Best Answer

Related Solutions

Sql-server – Indexing – Uniqueidentifier Foreign Key or Intermediary mapping table

Sql-server – Parent-Child Tree Hierarchical ORDER

Related Question