Sql-server – Performance of using a non-clustered index as primary key

index-tuningphysical-designsql serveruniqueidentifier

Background

I am designing a database for a record management system. It's early stages as I am considering what to use as a primary key: INT or UNIQUEIDENTIFIER.

The reason for considering UNIQUEIDENTIFIER is because there is a good chance we will want a feature that allows records to be moved to other systems (foreign database), and handling all relationships with GUIDs is going to make that so much easier.

Having done some research, there are plenty of debates on the pros and cons of each, the main one I am concerned with is performance. The outcome of my research is that using a UNIQUEIDENTIFIER as the PK will not be a problem as long as it isn't also used as the clustered index for the table.

Ok, so the obvious solution would be to split them into two separate columns:

TableA

ID: INT, Identity, Clustered Index 
PK: UNIQUEIDENTIFIER, Non-clustered Index

Question

My concern here is when I am defining relationships, how will this effect join performance? Take for example, this 'child' table:

TableB

ID: INT, Identity, Clustered Index 
PK: UNIQUEIDENTIFIER, Non-clustered Index
FK: UNIQUEIDENTIFIER, Non-clustered Index

Should I be concerned with any performance loss, particularly with joining tables, by using this approach?

For example, using a query such as:

SELECT * FROM TableB JOIN TableA ON TableA.PK = TableB.FK

Basically, is this the best design for supporting a GUID based PK, or is this going to lead to bad performance? Are there any significant disadvantages to using the approach I suggested?

Best Answer

Here is what I have seen based on several years of working with GUIDs as clustered, and nonclustered primary keys...there is no one correct answer here; the big thing really comes down to the access methods that you are going to use to get to the data, and the volume of data that you are going to return.

Yes, you will have the fragmentation problem with GUID clustered indexes that will have to be managed, however, if your access method uses that GUID as the query predicate, and you return multiple columns from that table, you may be better off suffering the fragmentation hit in order to improve the performance and lessen the impact of reading that data back out.

Here's a very basic example that show that there is less work required to get the data out when it's the clustered index, and the predicate used to get at the data:

/* NonClustered PK */
CREATE TABLE #T1
    (
      C1 INT NOT NULL ,
      C2 UNIQUEIDENTIFIER NOT NULL ,
      C3 VARCHAR(100) NULL ,
      C4 VARCHAR(20) NULL ,
      CONSTRAINT PK_T1 PRIMARY KEY NONCLUSTERED ( C2 )
    );
CREATE CLUSTERED INDEX T1_C1 ON #T1 (C1);

/* Clustered PK */
CREATE TABLE #T2
    (
      C1 INT NOT NULL ,
      C2 UNIQUEIDENTIFIER NOT NULL ,
      C3 VARCHAR(100) NULL ,
      C4 VARCHAR(20) NULL ,
      CONSTRAINT PK_T2 PRIMARY KEY CLUSTERED ( C2 )
    );



/* Insert 10 rows into each table */
INSERT INTO #T1
        ( C1, C2, C3, C4 )
VALUES  ( 0, '58BBB460-1AFA-4177-BA78-798DA19E0C97', 'some text', 'C4 text')
, ( 1, '17E8163B-BE21-44C7-A7B7-4997265A139D', 'some text', 'C4 text')
, ( 2, '16AACAB8-CD77-4A8D-BE87-9E433CD157EC', 'some text', 'C4 text')
, ( 3, '787D0714-F92A-4963-89E5-3F5DBF518EA7', 'some text', 'C4 text')
, ( 4, '5C720476-D4BE-4047-9F73-DBB1B6B75208', 'some text', 'C4 text')
, ( 5, 'D70F81C5-8AFF-4ABE-BA64-8F5C1A1C6A90', 'some text', 'C4 text')
, ( 6, '1473E5DC-6F3E-4164-988C-E36EE7C695BE', 'some text', 'C4 text')
, ( 7, '648AEA46-4B45-41F9-AA9B-7129062391B4', 'some text', 'C4 text')
, ( 8, '49497ECB-774D-482D-8230-218E97FB2EB4', 'some text', 'C4 text')
, ( 9, 'B90504FA-CEBA-4383-A61F-82F33DAB7A0E', 'some text', 'C4 text');

INSERT INTO #T2
        ( C1, C2, C3, C4 )
SELECT C1 ,
       C2 ,
       C3 ,
       C4 FROM #T1;

/* Index seek on T1 - no lookup as we include the clustering key */
SELECT C1, C2 FROM #T1 WHERE C2 = '648AEA46-4B45-41F9-AA9B-7129062391B4'

/* Index seek AND key lookup on T1 because there are columns not contained in the PK */
SELECT C1, C2, C3 FROM #T1 WHERE C2 = '648AEA46-4B45-41F9-AA9B-7129062391B4'

/* Clustered index seek on T2 as the predicate is the PK and clustered index */
SELECT C1, C2 FROM #T2 WHERE C2 = '648AEA46-4B45-41F9-AA9B-7129062391B4'

/* Stil a clustered index seek on T2 as the predicate is the PK and clustered index */
SELECT C1, C2, C3 FROM #T2 WHERE C2 = '648AEA46-4B45-41F9-AA9B-7129062391B4'

The things that you need to take into account, and think about are going to be:

Storage performance
Performance expectations around reads and writes
Access methods (columns returned, and columns used for the predicates)
Space available (bear in mind the clustering key is included in all nonclustered indexes, meaning that you will use more space, and waste more in the cache

Sadly there is no single correct answer for this, you can only get the appropriate expectations, and performing extensive testing to see if those expectations can be met with whatever data model you design, and if not, look to make adjustments to improve them.

Are there any significant disadvantages to using the approach I suggested?

It is very difficult to say. The only real disadvantage is that, if most of your queries come in via the nonclustered PK, and you are very read heavy you are going to see a great deal of increased IO as you will be performing key lookups on every call.

If your storage can withstand that, then it's great. And on the upside, you don't have the fragmentation problem which can lead to index maintenance hell over time (and can mean that you can't rebuild these things if you are running AGs and don't have maintenance periods).

Related Solutions

Sql-server – Excluding clustered key columns from non-clustered indexes definitions

On the SQL Saturday event I've attended, one of the lecturers said it's a always good practice to exclude the clustered key columns from the non-clustered index key definition and include clause, too.

I disagree. Let me explain with the table in your question:

The clustered index is: (RecordID, QuestionID) and there are many more columns.

Any non-clustered index will also have the clustered key columns, too, appended in the end. So, an index like:

(Pts)   is equivalent to:   (Pts, RecordID, QuestionID)

and similarly:

(Pts, PtsOf)           <->:  (Pts, PtsOf, RecordID, QuestionID)

(Pts, RecordId)        <->:  (Pts, RecordID, QuestionID)

(Pts, QuestionID)      <->:  (Pts, QuestionID, RecordID)

(Pts) INCLUDE (PtsOf)  <->:  (Pts, RecordID, QuestionID) INCLUDE (PtsOf) 

(QuestionID, RecordID) <->:  (QuestionID, RecordID)

For joint tables or like this one that have a composite primary/unique key - whether it is clustered or not, it's very often to have queries that need the (a,b) index and others that will use better the (b,a) index, sometimes queries that need both. So, one often needs both of these.

If the composite clustered key has more than two columns - say (a,b,c) - it's often that you may need an index on (b,c,a) or in (d,b) and another on (e,c,a) (which of course will be equivalent to (b,c,a), (d,b,a,c) and (e,c,a,b) respectively.) You can't just remove these columns from the definitions of the non-clustered keys because the column order will change.

The suggestion has one good point though. The clustered key columns can be removed from the INCLUDE part. They are redundant, just noise there.

About the indexes in the question, a non-CI on (Pts, PtsOf) is equivalent to (Pts, PtsOf, RecordID, QuestionID), so it is very different than the original non-CI on (RecordID, QuestionID) INCLUDE (Pts, PtsOf). It will use a bit more space than the original and of course these two indexes will be useful for different types of queries.

The (Pts, PtsOf) will be be useful, for example, for queries with WHERE Pts BETWEEN 0 AND 100, WHERE Pts = 200 AND PtsOf = 300, etc.
The (RecordID, QuestionID) INCLUDE (Pts, PtsOf) is basically a copy of the table with only the 2 clustered key columns and 2 only extra columns (of the many). This is (rarely) useful and it's a form of vertical partitioning. If you often have queries that need all the rows of the table but only these 2 columns, then it's probably one these (rather rare) cases where the extra space and effort to maintain this index is justified.

Sql-server – SQL Server 2008: Unused Clustered Index

But you have an index on TweetID. Is that index not used by the joins?

Even a non-clustered index will fragment with insert and update. As that index fragments inserts take longer. How are you managing fill factor and index maintenance on this non-clustered index?

An unused PK does nothing.

TweetID as PK will save some space. What % of rows get added daily. If only 1% rows get added daily you could use a fill factor of 90% and defrag daily or weekly. Start with like 80% or 90% and see how fast it fragments.

If the TweetID are out of order but in the same range then the problem you have there is fill factor gets hammered in that range and unused in the rest of the range. But a defag is faster as it is only cleaning up that part of the range. In that case you may be better off with 100% fill factor and a regular index reorganize.

There is no cut and dry answer. You need to manage and monitor the index for your data and your workload.

Best Answer

Related Solutions

Sql-server – Excluding clustered key columns from non-clustered indexes definitions

Sql-server – SQL Server 2008: Unused Clustered Index

Related Question