Sql-server – Sequential GUID or bigint for ‘huge’ database table PK

primary-keysql serveruniqueidentifier

I know this type of question comes up a lot, but I've yet to read any compelling arguments to help me make this decision. Please bear with me!

I have a huge database – it grows by about 10,000,000 records per day. The data is relational, and for performance reasons I load the table with BULK COPY. For this reason, I need to generate keys for the rows, and cannot rely on an IDENTITY column.

A 64-bit integer – a bigint – is wide enough for me to use, but in order to guarantee uniqueness, I need a centralised generator to make my IDs for me. I currently have such a generator service which allows a service to reserve X sequence numbers and guarantees no collisions. However, a consequence of this is that all the services I have are reliant on this one centralised generator, and so I'm limited in how I can distribute my system and am not happy about the other dependencies (such as requiring network access) imposed by this design. This has been a problem on occasion.

I'm now considering using sequential GUIDs as my primary keys (generated externally to SQL). As far as I've been able to ascertain from my own testing, the only drawback to these is the disk space overhead of a wider data type (which is exacerbated by their use in indexes). I've not witnessed any discernible slowdown in query performance, compared to the bigint alternative. Loading the table with BULK COPY is slightly slower, but not by much. My GUID-based indexes are not becoming fragmented thanks to my sequential GUID implementation.

Basically, what I want to know is if there are any other considerations I may have overlooked. At the moment, I'm inclined to take the leap and start using GUIDs. I'm by no means a database expert, so I'd really appreciate any guidance.

Best Answer

I'm in a similar same situation. Currently, I'm using the sequential GUID approach and have no fragmentation and easy key generation.

I have noticed two disadavantages that caused me to start migrating to bigint:

Space usage. 8 bytes more per index. Multiply that by 10 indexes or so and you get a huge waste of space.
Columnstore indexes do not support GUIDs.

(2) Was the killer for me.

I will now generate my keys like this:

yyMMddHH1234567890

I'll be using a leading date plus hour and having a sequential part after that. That allows me to range-query my data by date without any addition index at all. This is a nice bonus for me.

I'll generate the sequential part of the bigint using a HiLo algorithm that lends itself well to being distributed.

Hope some of this transfers to your situation. I definitely recommend using bigint.

Related Solutions

Sql-server – Do natural keys provide higher or lower performance in SQL Server than surrogate integer keys

In general, SQL Server uses B+Trees for indexes. The expense of an index seek is directly related to the length of the key in this storage format. Hence, a surrogate key usually outperforms a natural key on index seeks.

SQL Server clusters a table on the primary key by default. The clustered index key is used to identify rows, so it gets added as included column to every other index. The wider that key, the larger every secondary index.

Even worse, if the secondary indexes are not explicitly defined as UNIQUE the clustered index key automatically becomes part of the key of each of those. That usually applies to most indexes, as usually indexes are declared as unique only when the requirement is to enforce uniqueness.

So if the question is, natural versus surrogate clustered index, the surrogate will almost always win.

On the other hand, you are adding that surrogate column to the table making the table in itself bigger. That will cause clustered index scans to get more expensive. So, if you have only very few secondary indexes and your workload requires to look at all (or most of the) rows often, you actually might be better of with a natural key saving those few extra bytes.

Finally, natural keys often make it easier to understand the data model. While using more storage space, natural primary keys lead to natural foreign keys which in turn increase local information density.

So, as so often in the database world, the real answer is "it depends". And - always test in your own environment with realistic data.

Sql-server – Fragmented clustered primary key (sequential GUID) index after processing – SQL Server

Now I'm worried that this schema change won't help that much.

Changing the column from a varchar(36) to a uniqueidentifier will greatly reduce storage requirements from 36 bytes per row to 16 bytes per row; for each 100,000 row table, that equates to a savings of 2MB. 2MB may not sound like much by itself, but since the column is the primary key, that 2MB applies to each non-clustered index on the table. If any other tables are using foreign keys to ensure relational integrity, the space savings applies to those tables as well. Realize also that this space savings doesn't only apply to on-disk storage, it also applies equally, and one might argue, more importantly, to rows contained in the buffer-pool.

It appears from a cursory glance at your sample table above that the average row size is somewhere around 600 bytes. Converting the varchar(36) to a uniqueidentifier will immediately offer a 3.3% space savings for the clustered index, and more than that for each non-clustered index.

As a bonus, once you convert the column from a varchar(36) to a uniqueidentifier, it will no longer be possible to insert this is a bad key into the column, as is currently possible. This will result in much better data integrity, and will potentially save you a lot of work down the road.

Having said that, I decided to try to recreate the problems you're seeing with index fragmentation. I created a test table with the varchar(36) primary key column with 100,000 rows. I then converted the column to a uniqueidentifier and was unable to see a drastic increase in index fragmentation. My test bed code is:

USE tempdb;

/* create our test table, with a VARCHAR primary key */
IF OBJECT_ID('dbo.guid_insert') IS NOT NULL
DROP TABLE dbo.guid_insert;
CREATE TABLE dbo.guid_insert
(
    PK VARCHAR(36) NOT NULL
        CONSTRAINT PK_guid_insert
        PRIMARY KEY CLUSTERED
        WITH (
            DATA_COMPRESSION = NONE
            , PAD_INDEX = OFF
            , ALLOW_ROW_LOCKS = ON
            , ALLOW_PAGE_LOCKS = ON
            , FILLFACTOR = 80
            )
    , SomeData VARCHAR(600)
    , CreatedDate DATETIME
        CONSTRAINT DF_quid_insert_CreatedDate
        DEFAULT (GETDATE())
) ON [PRIMARY];

/* insert 100,000 rows into the test table */
INSERT INTO dbo.guid_insert (PK, SomeData)
SELECT NEWID(), REPLICATE('.', 600)
FROM (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9)) v(num)
CROSS APPLY (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9)) v1(num)
CROSS APPLY (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9)) v2(num)
CROSS APPLY (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9)) v3(num)
CROSS APPLY (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9)) v4(num);

/* show the fragmentation details */
SELECT ObjectName = QUOTENAME(s.name) + '.' + QUOTENAME(o.name)
    , IndexName = QUOTENAME(i.name)
    , [FillFactor] = 'FILLFACTOR = ' + CONVERT(VARCHAR(50), CASE WHEN i.fill_factor = 0 THEN '100' ELSE i.fill_factor END)
    , CompressOption = 'DATA_COMPRESSION = ' + p.data_compression_desc
    , NumberOfFragments = ips.fragment_count
    , NumberOfPages = ips.page_count
    , AvgFragmentSizeInPages = ips.avg_fragment_size_in_pages
    , AvgFragmentationInPercent = ips.avg_fragmentation_in_percent
    , IndexType = i.type_desc
    , IsUnique = i.is_unique
    , IsPrimary = i.is_primary_key
    , IsPartitioned = CASE WHEN ps.data_space_id IS NULL THEN 0 ELSE 1 END
    , IsClustered = CASE WHEN i.type = 1 THEN 1 ELSE 0 END
FROM sys.indexes i WITH (NOLOCK)
    INNER JOIN sys.objects o WITH (NOLOCK) ON i.object_id = o.object_id
    INNER JOIN sys.schemas s WITH (NOLOCK) ON o.schema_id = s.schema_id
    INNER JOIN sys.data_spaces ds WITH (NOLOCK) ON i.data_space_id = ds.data_space_id
    INNER JOIN sys.partitions p WITH (NOLOCK) ON o.object_id = p.object_id
                                    AND i.index_id = p.index_id
    LEFT JOIN sys.partition_schemes ps WITH (NOLOCK) ON ds.data_space_id = ps.data_space_id
CROSS APPLY sys.dm_db_index_physical_stats(DB_ID(), o.object_id, i.index_id, p.partition_number, 'DETAILED') ips
WHERE o.is_ms_shipped = 0
    AND NOT (
        o.type = 'TF' -- table valued function
        OR o.type = 'TT' -- table type
        OR o.type = 'SO' -- sequence object
        )
    AND i.index_id > 0
    AND i.is_disabled = 0
    AND i.is_hypothetical = 0
    AND (
        ds.type = 'FG' -- filegroup
        OR ds.type = 'PS' -- partition stream
        ) 
    AND i.type_desc IN 
        ( --we only support rebuilding/reorganizing these index types:
            'CLUSTERED'
            , 'NONCLUSTERED'
            , 'XML'
        )
    AND ips.index_level = 0 -- leaf-levels only
    AND ips.fragment_count > 1
ORDER BY ips.fragment_count * ips.avg_fragmentation_in_percent DESC;

The fragmentation query results:

/* rebuild the index to remove fragmentation */
ALTER INDEX PK_guid_insert ON dbo.guid_insert REBUILD;

/* drop the primary key, since we can't modify the PK column when it is a primary key */
ALTER TABLE dbo.guid_insert
DROP CONSTRAINT PK_guid_insert;

/* convert the VARCHAR PK column into a UNIQUEIDENTIFIER column */
ALTER TABLE dbo.guid_insert
ALTER COLUMN PK UNIQUEIDENTIFIER NOT NULL;

/* recreate the primary key */
ALTER TABLE dbo.guid_insert
ADD CONSTRAINT PK_guid_insert
PRIMARY KEY CLUSTERED (PK);

/* add a DEFAULT constraint to create sequential keys */
ALTER TABLE dbo.guid_insert
ADD CONSTRAINT DF_guid_insert_PK
DEFAULT (NEWSEQUENTIALID())
FOR PK;

/* delete 10,000 random rows */
;WITH src AS
(
SELECT rn = ROW_NUMBER() OVER (ORDER BY NEWID()), *
FROM dbo.guid_insert
)
DELETE 
FROM src 
WHERE rn < 10000;

/* insert 100,000 more rows */
INSERT INTO dbo.guid_insert (SomeData)
SELECT REPLICATE('.', 600)
FROM (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9)) v(num)
CROSS APPLY (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9)) v1(num)
CROSS APPLY (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9)) v2(num)
CROSS APPLY (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9)) v3(num)
CROSS APPLY (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9)) v4(num);

/* show the fragmentation details */
SELECT ObjectName = QUOTENAME(s.name) + '.' + QUOTENAME(o.name)
    , IndexName = QUOTENAME(i.name)
    , [FillFactor] = 'FILLFACTOR = ' + CONVERT(VARCHAR(50), CASE WHEN i.fill_factor = 0 THEN '100' ELSE i.fill_factor END)
    , CompressOption = 'DATA_COMPRESSION = ' + p.data_compression_desc
    , NumberOfFragments = ips.fragment_count
    , NumberOfPages = ips.page_count
    , AvgFragmentSizeInPages = ips.avg_fragment_size_in_pages
    , AvgFragmentationInPercent = ips.avg_fragmentation_in_percent
    , IndexType = i.type_desc
    , IsUnique = i.is_unique
    , IsPrimary = i.is_primary_key
    , IsPartitioned = CASE WHEN ps.data_space_id IS NULL THEN 0 ELSE 1 END
    , IsClustered = CASE WHEN i.type = 1 THEN 1 ELSE 0 END
FROM sys.indexes i WITH (NOLOCK)
    INNER JOIN sys.objects o WITH (NOLOCK) ON i.object_id = o.object_id
    INNER JOIN sys.schemas s WITH (NOLOCK) ON o.schema_id = s.schema_id
    INNER JOIN sys.data_spaces ds WITH (NOLOCK) ON i.data_space_id = ds.data_space_id
    INNER JOIN sys.partitions p WITH (NOLOCK) ON o.object_id = p.object_id
                                    AND i.index_id = p.index_id
    LEFT JOIN sys.partition_schemes ps WITH (NOLOCK) ON ds.data_space_id = ps.data_space_id
CROSS APPLY sys.dm_db_index_physical_stats(DB_ID(), o.object_id, i.index_id, p.partition_number, 'DETAILED') ips
WHERE o.is_ms_shipped = 0
    AND NOT (
        o.type = 'TF' -- table valued function
        OR o.type = 'TT' -- table type
        OR o.type = 'SO' -- sequence object
        )
    AND i.index_id > 0
    AND i.is_disabled = 0
    AND i.is_hypothetical = 0
    AND (
        ds.type = 'FG' -- filegroup
        OR ds.type = 'PS' -- partition stream
        ) 
    AND i.type_desc IN 
        ( --we only support rebuilding/reorganizing these index types:
            'CLUSTERED'
            , 'NONCLUSTERED'
            , 'XML'
        )
    AND ips.index_level = 0 -- leaf-levels only
    AND ips.fragment_count > 1
ORDER BY ips.fragment_count * ips.avg_fragmentation_in_percent DESC; --start with the most fragmented indexes first

The fragmentation details after the conversion, with deleting and inserting rows:

As you can see, after the conversion, and deleting 10,000 rows, then inserting 100,000 rows, index fragmentation for the primary key index (the table itself) is just 0.41%; hardly a concern.

If you need further help with this problem (I realize your question is a year old), you need to show the DDL for the table, its indexes, and the code you used for the conversion test.

Best Answer

Related Solutions

Sql-server – Do natural keys provide higher or lower performance in SQL Server than surrogate integer keys

Sql-server – Fragmented clustered primary key (sequential GUID) index after processing – SQL Server

Related Question