SQL Server 2008 R2 – Gigantic Gaps in IDENTITY Column

identitysql serversql-server-2008-r2

This question is about strange behavior in SQL Server 2008 R2 SP3 Standard Edition x64.

I have read many questions about IDENTITY increments in SQL Server and strange gaps between them. But basically people say it is related to transactions or restarts. In my case (using a named instance) gaps are simply gigantic.

Application inserted 130 000 records in table, and "Arithmetic overflow IDENTITY conversion to int" error occurred. We truncate the table, and bulk insert 70 000 records. During bulk insert IDENTITY column maximal value is above 9 000 000!

As it was not normal load but mass insert – there was no rollback, restart nor any other issue during this operation.

So gaps are not in the size of 1000 but tenths of thousands.

As I do not control application code (it is commercial software bought from other company) I am simply afraid what can I do as Administrator of such database? Have I to truncate table once every 6 months? Monitor in real time ID column?

Gaps are not in the size of 1000 but tenths of thousands. Are there any pupils experiencing similar issue?

Best Answer

It is very much normal for there to be gaps in identity values (caused by rolled-back transactions, over estimated pre-allocations, and so forth) as you are already aware, though that sort of sparse use of values is rather unusual and indicates that there is something going on which you need to be aware of (either a design issue that may be unavoidable and you just need to plan around or a fault that needs fixing).

The first thing I would ask is are you touching the IDENTITY column at all in the bulk insert? If you are working with IDENTITY_INSERT ON or similar circumstances and the imported data contains a large number there then that is the source of the problem. Another thing to check for is hidden processes that might touch the table - do you have any triggers that might be firing as data is added to the table and doing something odd?

Otherwise, if you could provide more detail in your question that might help someone help you:

The structure of the table in question (and any others touched by the same bulk process): column/key/index definitions and any triggers
The code you are using for the bulk insert (TSQL, a call to BCP, what-ever method you are using)
A sample of the data (perhaps anonymised if it is sensitive)
When you performed your test was the database otherwise quiet or was it active with other transactions potentially being processed?
Are the gaps following any sort of regular pattern, or do they seem random?

As a work-around you might consider using a BIGINT for the column instead, though if it is referred to by foreign keys elsewhere that could be a big job to change (and if the value isn't purely internal to the database you need to consider if the code touching it can cope with values beyond the 32-bit limits).

Related Solutions

Sql-server – SQL Server IDENTITY column automatic re-seed

You can use the bigint data type in order to avoid reaching the max value too often.

The range for bigint is -2^63 (-9,223,372,036,854,775,808) to 2^63-1 (9,223,372,036,854,775,807), storage = 8 Bytes. Are you often out of this limit?

In case you want to reseed, you should be able to run the dbcc checkident statement inside a transaction, but you'll have to do more error handling inside your code.

I don't know of an IDENTITY property to automatically reseed its value, only manually, by truncating the table or by using dbcc checkident. You could create a trigger on your table and once the last inserted value gets close to the maximum then you'd be able to reseed. But this will probably add cost to your transactions.

Sql-server – Reasons for avoiding large ID values

Without seeing code, it is pretty hard to say conclusively what is happening. Although, most likely the IDENTITY value is being cached, causing gaps in the value after SQL Server is restarted. See https://stackoverflow.com/questions/17587094/identity-column-value-suddenly-jumps-to-1001-in-sql-server for some good answers and info about that.

A simple INT field can hold values up to 2,147,483,647. You can actually start the identity value at -2,147,483,648, giving a full 32 bits of values. 4 Billion distinct values. I doubt very much you're going to run out of values to use. Assuming your application is consuming 1,000 values for each actual row added, you'd need to be creating nearly 12,000 rows per day every day to run out of IDs in 6 months assuming you started the IDENTITY value at 0, and were using an INT. If you were using a BIGINT, you would have to wait 21 million centuries before you ran out of values if you wrote 12,000 rows per day, consuming 1,000 "values" per row.

Having said all that, if you wanted to use BIGINT as the identity field data type, there is certainly nothing wrong with that. That'll give you for all intents-and-purposes, a limitless supply of values to use. The performance difference between an INT and a BIGINT is practically non-existent on modern 64-bit hardware, and highly preferable over for-instance using NEWID() to generate GUIDs.

If you wanted to manage your own values for the ID column, you could create a key table, and provide a pretty bulletproof way of doing that using one of the methods shown in the answers on this question: Handling concurrent access to a key table without deadlocks in SQL Server

The other option, assuming you're using SQL Server 2012+, would be to use a SEQUENCE object to get ID values for the column. However, you'd need to configure the sequence to not cache values. For example:

CREATE SEQUENCE dbo.MySequence AS INT START WITH -2147483648 INCREMENT BY 1 NO CACHE;

In answer to your boss' negative perception of "high" numbers, I would say what difference does it make? Assuming you use an INT field, with an IDENTITY, you could in fact start the IDENTITY at 2147483647 and "increment" the value by -1. This would make absolutely no difference to the memory consumption, performance, or disk space used since a 32 bit number is 4 bytes, no matter if it is 0 or 2147483647. 0 in binary is 00000000000000000000000000000000 when stored in a 32-bit signed INT field. 2147483647 is 01111111111111111111111111111111 - both numbers take precisely the same amount of space, both in memory, and on disk, and both require precisely the same amount of CPU operations to process. It is far more important to get your application code designed correctly than to obsess about the actual number stored in a key field.

You asked about the pros and cons of either (a) using a larger-capacity ID column, such as a BIGINT, or (b) rolling your own solution to prevent ID gaps. To answer these concerns:

BIGINT instead of INT as the data-type for the column in question. Using a BIGINT requires double the amount of storage, both on-disk, and in-memory for the column itself. If the column is the primary key index for the table involved, each and every non-clustered index attached to the table will also store the BIGINT value, at twice the size of an INT, again both in-memory and on-disk. SQL Server stores data on disk in 8KB pages, where the number of "rows" per "page" depends on the "width" of each row. So, for instance, if you have a table with 10 columns, each one an INT, you'd be approximately able to store 160 rows per page. If those columns where instead BIGINT columns, you'd only be able to store 80 rows per page. For a table with a very large number of rows, this clearly means I/O required to read and write the table will be double in this example for any given number of rows. Granted, this is a pretty extreme example - if you had a row consisting of a single INT or BIGINT column and a single NCHAR(4000) column, you'd be (simplistically) getting a single row per page, whether you used an INT or a BIGINT. In this scenario, it would not make much appreciable difference.
Rolling your own scenario to prevent gaps in the ID column. You'd need to write your code in such a way that determining the "next" ID value to use does not conflict with other actions happening to the table. Something along the lines of SELECT TOP(1) [ID] FROM [schema].[table] naively comes to mind. What if there are multiple actors attempting to write new rows to the table simultaneously? Two actors could easily obtain the same value, resulting in a write-conflict. Getting around this problem requires serializing access to the table, reducing performance. There have been many articles written about this problem; I'll leave it to the reader to perform a search on that topic.

The conclusion here is: you need to understand your requirements and properly estimate both the number of rows, and the row width, along with concurrency requirements of your application. As usual, It Depends™.

Best Answer

Related Solutions

Sql-server – SQL Server IDENTITY column automatic re-seed

Sql-server – Reasons for avoiding large ID values

Related Question