Sql-server – Does Clustered Index on GUID create more fragmentation than Non Clustered Index

clustered-indexfragmentationidentitysql serversql server 2014

I have a simple table:

CREATE TABLE [dbo].[UserTestGroups](
    [UserTestGroupId] [int] IDENTITY(1,1) NOT NULL,
    [Token] [nvarchar](100) NOT NULL,
    [TestId] [bigint] NOT NULL,
    [Group] [tinyint] NOT NULL,
    [InsertDate] [datetime] NOT NULL)

The table will have a relatively large number of inserts – initially 10,000 sessions with 8 rows per session for a total of 80,000 daily inserts. We expect that to increase significantly in the near future. Regardless, we try to make our platform as resilient as possible, and not just to be OK with current load.

The table will be used for reporting, which can be considered a secondary requirement.

Since we are using Entity Framework (Microsoft's ORM) all tables that are written to need a logical primary key, so I added an Identity column that I simply ignore. I don't like having composite keys, they tend to make nightmare with ORMs, so unless I absolutely have to, I add another identity column and make it the PK.

All the BI report's queries will be based on the Token, which is a sessionId – the GUID, so I created the clustered index on that token column.

Our DBA, who is a SQL Server MVP, told me that having the clustered index on a GUID column will cause fragmentation and that I should create the clustered index on the IDENTITY column, and create a nonclustered index for the Token column.

I don't understand, won't the nonclustered index have the same fragmentation problem? Why would duplicating the data into a new index be better than using a clustered index?

The token-Guid isn't the PK, the PK is a composition of all the columns together (including the InsertDate).

Currently (for I really don't know what reason) all GUID values are stored as NVARCHAR(100), I don't know what's the historical reason-justification for that, but that is what we have.

Our DBA is overseas and our communication is a tweet long, so I can't get a proper answer until he comes back

Best Answer

Since you are indicating insert performance is the primary concern, I'd take the recommendation of the DBA and make the clustering key the identity column since it is a unique, monotonically ascending number, which is guaranteed to (almost) never cause page-splits on the table.

Also, don't store the GUID in an nvarchar(100) column, use the data type designed for them, which is uniqueidentifier.

Look carefully at the reporting requirements prior to adding indexes to this table. You may find out you don't need an index on the GUID column at all. If you determine that having an index on the GUID column is required, you may find it desirable to have the leading column of the index not be the GUID, perhaps it will be the Group or TestID column instead?

If you've determined, through a rigorous design process, that the majority of reporting queries against this table will be made using the session id as the primary component of joins or where clauses, you may actually benefit from the table being clustered on the session id. You'd likely want to set the page fill factor at some point below 100%, perhaps start at 93% and monitor fragmentation levels. Also, it's worth pointing out that fragmentation may not be such a massive problem if you store this table on SSD. Will the session id be propagated throughout the other related tables? If so, you'll likely need non-clustered indexes on the column in those other tables as well, compounding the fragmentation issue. It may be worthwhile maintaining a single table where you insert the GUIDs to obtain a surrogate integer key to use in the other tables, something like:

IF OBJECT_ID('dbo.SessionSurrogate') IS NOT NULL
DROP TABLE dbo.SessionSurrogate;
CREATE TABLE dbo.SessionSurrogate
(
    SessionID int NOT NULL
        CONSTRAINT PK_SessionSurrogate
        PRIMARY KEY NONCLUSTERED
        IDENTITY(1,1)
    , SessionGUID uniqueidentifier NOT NULL
);
CREATE UNIQUE INDEX CI_SessionSurrogate --could be non-unique if required
ON dbo.SessionSurrogate(SessionGUID);

IF OBJECT_ID('dbo.UserTestGroups') IS NOT NULL
DROP TABLE dbo.UserTestGroups;
CREATE TABLE dbo.UserTestGroups
(
    UserTestGroupsID int NOT NULL
        CONSTRAINT PK_UserTestGroups
        PRIMARY KEY CLUSTERED
        IDENTITY(1,1)
    , SessionID int NOT NULL
    , TestID bigint NOT NULL
    , [Group] tinyint NOT NULL
    , InsertDate datetime NOT NULL
);

GO
CREATE PROCEDURE dbo.InsertSession
(
    @SessionGUID uniqueidentifier
    , @TestID bigint
    , @Group tinyint
)
AS
BEGIN
    INSERT INTO dbo.SessionSurrogate (SessionGUID)
    OUTPUT INSERTED.SessionID, @TestID, @Group, GETDATE()
    INTO dbo.UserTestGroups (SessionID, TestID, [Group], InsertDate)
    OUTPUT INSERTED.SessionID
    VALUES (@SessionGUID);
END;
GO

You'd then insert data like:

DECLARE @SessGUID uniqueidentifier = NEWID();
DECLARE @TestID bigint = 42;
DECLARE @Group tinyint = 6;
EXEC dbo.InsertSession @SessGUID, @TestID, @Group;

The benefit here is you're only causing fragmentation on a single table with a very narrow row width (20 bytes), offering somewhere around 400 rows per page. The stored procedure above is very compact and outputs the generated int for you to use later, if necessary.