Performance Concerns with UniqueIdentifiers (GUIDs)

database-designsql server

I am a software developer/architect at my company and occasionally our DBA whenever our primary DBA quits, so I think I kind of know what I'm doing, but this has me stumped as to the best course of action.

We are working on moving a monolithic system into microservices. As part of this endeavor we've decided to use event streaming via Kafka. The problem we are facing is that we need to be able to come up with identifiers for any entity we create before we can send anything to Kafka (Kafka needs that identifier to keep any events for that entity ordered). These identifiers will need to be created on multiple machines that host our client facing applications (websites and APIs). So using a uniqueidentifier / GUID seems like a logical choice here.

If we were building this from scratch I wouldn't worry about this so much yet because we could separate various kinds of entities into their own data stores (users could be in one store / database, teams could be in another, etc.) which would keep the number of rows in a given table fairly well bounded and the querying load could be spread across multiple servers. However, our current system was poorly designed and every entity in our system lives in one large, very denormalized table (so any given row could represent a user, a team, a service, etc.). This table has somewhere between 8 million and 9 million rows in it. We also have to keep this table updated with any new data during our rewrite (which includes new GUIDs). And that's what has me worried.

We need to look these entities up by these identifiers and keep queries fast. And I know GUIDs make for horrible keys (for clustering or indexes) because they lead to huge fragmentation, page splitting, etc. We have a part of our existing system that uses GUIDs as a lookup for an entity. With a few hundred thousand rows the indexes fragmented to the point of uselessness and queries were relegated to table scans. We eventually were able to change the code to use sequential unique identifiers, but that isn't an option here. That makes me worried that using random GUIDs as identifiers is going to become a performance nightmare.

So the question is, is there a way I can use a random GUID as a non-clustering key on a large-ish table in such a way that keeps lookups via those GUIDs fast? What kinds of things can I do to mitigate the problem? I'm also open to alternative solutions here because we haven't reached a point of no return.

Best Answer

I would do something like the below example. Keep an integer (or bigint if you need it) as an identity column and the clustering key. This will keep the table neatly ordered with most recently added rows at the end, preventing fragmentation of your largest part of the table (the clustered index).

Then create a unique (if you need it to be unique) non-clustered index on the External ID (the random GUID), with a larger fill factor (to minimize fragmentation between re-org's).

This should make lookups on the ExternalID fairly fast (narrow index) and keep that index quickly re-organizable. You can play with the fill factor if needed to make room for your daily load.

EDIT (advice): You may need to keep an eye on the execution plans for this. If you ask for a small number of ExternalID's at once then it should be fine, but asking for lots of them may cause SQL to decide that a table scan is the best idea; which probably won't be.

CREATE TABLE dbo.LargishTable
    (
    ID INT NOT NULL IDENTITY(1,1)
    , ExternalID UNIQUEIDENTIFIER NOT NULL
    , OtherColumns NVARCHAR(MAX)
    , CONSTRAINT PK_LargishTable PRIMARY KEY CLUSTERED (ID)
    )

GO

CREATE UNIQUE NONCLUSTERED INDEX IDXUQ_LargishTable_ExternalID
    ON dbo.LargishTable (ExternalID)
    WITH (FILLFACTOR=70, SORT_IN_TEMPDB=ON)

GO