Sql-server – Performance of using a non-clustered index as primary key

index-tuningphysical-designsql serveruniqueidentifier

Background

I am designing a database for a record management system. It's early stages as I am considering what to use as a primary key: INT or UNIQUEIDENTIFIER.

The reason for considering UNIQUEIDENTIFIER is because there is a good chance we will want a feature that allows records to be moved to other systems (foreign database), and handling all relationships with GUIDs is going to make that so much easier.

Having done some research, there are plenty of debates on the pros and cons of each, the main one I am concerned with is performance. The outcome of my research is that using a UNIQUEIDENTIFIER as the PK will not be a problem as long as it isn't also used as the clustered index for the table.

Ok, so the obvious solution would be to split them into two separate columns:

TableA

ID: INT, Identity, Clustered Index 
PK: UNIQUEIDENTIFIER, Non-clustered Index

Question

My concern here is when I am defining relationships, how will this effect join performance? Take for example, this 'child' table:

TableB

ID: INT, Identity, Clustered Index 
PK: UNIQUEIDENTIFIER, Non-clustered Index
FK: UNIQUEIDENTIFIER, Non-clustered Index

Should I be concerned with any performance loss, particularly with joining tables, by using this approach?

For example, using a query such as:

SELECT * FROM TableB JOIN TableA ON TableA.PK = TableB.FK

Basically, is this the best design for supporting a GUID based PK, or is this going to lead to bad performance? Are there any significant disadvantages to using the approach I suggested?

Best Answer

Here is what I have seen based on several years of working with GUIDs as clustered, and nonclustered primary keys...there is no one correct answer here; the big thing really comes down to the access methods that you are going to use to get to the data, and the volume of data that you are going to return.

Yes, you will have the fragmentation problem with GUID clustered indexes that will have to be managed, however, if your access method uses that GUID as the query predicate, and you return multiple columns from that table, you may be better off suffering the fragmentation hit in order to improve the performance and lessen the impact of reading that data back out.

Here's a very basic example that show that there is less work required to get the data out when it's the clustered index, and the predicate used to get at the data:

/* NonClustered PK */
CREATE TABLE #T1
    (
      C1 INT NOT NULL ,
      C2 UNIQUEIDENTIFIER NOT NULL ,
      C3 VARCHAR(100) NULL ,
      C4 VARCHAR(20) NULL ,
      CONSTRAINT PK_T1 PRIMARY KEY NONCLUSTERED ( C2 )
    );
CREATE CLUSTERED INDEX T1_C1 ON #T1 (C1);

/* Clustered PK */
CREATE TABLE #T2
    (
      C1 INT NOT NULL ,
      C2 UNIQUEIDENTIFIER NOT NULL ,
      C3 VARCHAR(100) NULL ,
      C4 VARCHAR(20) NULL ,
      CONSTRAINT PK_T2 PRIMARY KEY CLUSTERED ( C2 )
    );



/* Insert 10 rows into each table */
INSERT INTO #T1
        ( C1, C2, C3, C4 )
VALUES  ( 0, '58BBB460-1AFA-4177-BA78-798DA19E0C97', 'some text', 'C4 text')
, ( 1, '17E8163B-BE21-44C7-A7B7-4997265A139D', 'some text', 'C4 text')
, ( 2, '16AACAB8-CD77-4A8D-BE87-9E433CD157EC', 'some text', 'C4 text')
, ( 3, '787D0714-F92A-4963-89E5-3F5DBF518EA7', 'some text', 'C4 text')
, ( 4, '5C720476-D4BE-4047-9F73-DBB1B6B75208', 'some text', 'C4 text')
, ( 5, 'D70F81C5-8AFF-4ABE-BA64-8F5C1A1C6A90', 'some text', 'C4 text')
, ( 6, '1473E5DC-6F3E-4164-988C-E36EE7C695BE', 'some text', 'C4 text')
, ( 7, '648AEA46-4B45-41F9-AA9B-7129062391B4', 'some text', 'C4 text')
, ( 8, '49497ECB-774D-482D-8230-218E97FB2EB4', 'some text', 'C4 text')
, ( 9, 'B90504FA-CEBA-4383-A61F-82F33DAB7A0E', 'some text', 'C4 text');

INSERT INTO #T2
        ( C1, C2, C3, C4 )
SELECT C1 ,
       C2 ,
       C3 ,
       C4 FROM #T1;

/* Index seek on T1 - no lookup as we include the clustering key */
SELECT C1, C2 FROM #T1 WHERE C2 = '648AEA46-4B45-41F9-AA9B-7129062391B4'

/* Index seek AND key lookup on T1 because there are columns not contained in the PK */
SELECT C1, C2, C3 FROM #T1 WHERE C2 = '648AEA46-4B45-41F9-AA9B-7129062391B4'

/* Clustered index seek on T2 as the predicate is the PK and clustered index */
SELECT C1, C2 FROM #T2 WHERE C2 = '648AEA46-4B45-41F9-AA9B-7129062391B4'

/* Stil a clustered index seek on T2 as the predicate is the PK and clustered index */
SELECT C1, C2, C3 FROM #T2 WHERE C2 = '648AEA46-4B45-41F9-AA9B-7129062391B4'

The things that you need to take into account, and think about are going to be:

  • Storage performance
  • Performance expectations around reads and writes
  • Access methods (columns returned, and columns used for the predicates)
  • Space available (bear in mind the clustering key is included in all nonclustered indexes, meaning that you will use more space, and waste more in the cache

Sadly there is no single correct answer for this, you can only get the appropriate expectations, and performing extensive testing to see if those expectations can be met with whatever data model you design, and if not, look to make adjustments to improve them.

Are there any significant disadvantages to using the approach I suggested?

It is very difficult to say. The only real disadvantage is that, if most of your queries come in via the nonclustered PK, and you are very read heavy you are going to see a great deal of increased IO as you will be performing key lookups on every call.

If your storage can withstand that, then it's great. And on the upside, you don't have the fragmentation problem which can lead to index maintenance hell over time (and can mean that you can't rebuild these things if you are running AGs and don't have maintenance periods).