I have a simple table:
CREATE TABLE [dbo].[UserTestGroups](
[UserTestGroupId] [int] IDENTITY(1,1) NOT NULL,
[Token] [nvarchar](100) NOT NULL,
[TestId] [bigint] NOT NULL,
[Group] [tinyint] NOT NULL,
[InsertDate] [datetime] NOT NULL)
The table will have a relatively large number of inserts – initially 10,000 sessions with 8 rows per session for a total of 80,000 daily inserts. We expect that to increase significantly in the near future. Regardless, we try to make our platform as resilient as possible, and not just to be OK with current load.
The table will be used for reporting, which can be considered a secondary requirement.
Since we are using Entity Framework (Microsoft's ORM) all tables that are written to need a logical primary key, so I added an Identity column that I simply ignore. I don't like having composite keys, they tend to make nightmare with ORMs, so unless I absolutely have to, I add another identity column and make it the PK.
All the BI report's queries will be based on the Token, which is a sessionId – the GUID, so I created the clustered index on that token
column.
Our DBA, who is a SQL Server MVP, told me that having the clustered index on a GUID column will cause fragmentation and that I should create the clustered index on the IDENTITY
column, and create a nonclustered index for the Token
column.
I don't understand, won't the nonclustered index have the same fragmentation problem? Why would duplicating the data into a new index be better than using a clustered index?
The token-Guid isn't the PK, the PK is a composition of all the columns together (including the InsertDate
).
Currently (for I really don't know what reason) all GUID values are stored as NVARCHAR(100)
, I don't know what's the historical reason-justification for that, but that is what we have.
Our DBA is overseas and our communication is a tweet long, so I can't get a proper answer until he comes back
Best Answer
Since you are indicating insert performance is the primary concern, I'd take the recommendation of the DBA and make the clustering key the identity column since it is a unique, monotonically ascending number, which is guaranteed to (almost) never cause page-splits on the table.
Also, don't store the GUID in an
nvarchar(100)
column, use the data type designed for them, which isuniqueidentifier
.Look carefully at the reporting requirements prior to adding indexes to this table. You may find out you don't need an index on the GUID column at all. If you determine that having an index on the GUID column is required, you may find it desirable to have the leading column of the index not be the GUID, perhaps it will be the
Group
orTestID
column instead?If you've determined, through a rigorous design process, that the majority of reporting queries against this table will be made using the session id as the primary component of joins or where clauses, you may actually benefit from the table being clustered on the session id. You'd likely want to set the page fill factor at some point below 100%, perhaps start at 93% and monitor fragmentation levels. Also, it's worth pointing out that fragmentation may not be such a massive problem if you store this table on SSD. Will the session id be propagated throughout the other related tables? If so, you'll likely need non-clustered indexes on the column in those other tables as well, compounding the fragmentation issue. It may be worthwhile maintaining a single table where you insert the GUIDs to obtain a surrogate integer key to use in the other tables, something like:
You'd then insert data like:
The benefit here is you're only causing fragmentation on a single table with a very narrow row width (20 bytes), offering somewhere around 400 rows per page. The stored procedure above is very compact and outputs the generated
int
for you to use later, if necessary.