Sql-server – Reasons for avoiding large ID values

auto-incrementsql server

We are working on a web application, not yet accessible to users. My boss noticed that newly created records get an ID of over 10 000, even though we only have under 100 records in the table. She assumed that the web interface for some reason creates over a 100 times more temporary records than actual ones (and deletes them) and that this can lead us to running out of range within a few months of release.

I don't think she is correct about the cause of the ID inflation (the colleague who can answer this is on vacation, so we don't know for sure), but let's assume that she is. She said that she'd hate to use a bigint column, and that she'd like us to stop autoincrementing the ID column and write server-side code which chooses the first "unused" integer and uses it as an ID.

I am a computer science grad student with little practical experience, filling a junior developer role. She has years of experience of managing all of our organisation's databases, and designing most of them. I think that she's incorrect in this case, that a bigint ID is nothing to be afraid of, and that mimicking the DBMS functionality smells of an antipattern. But I don't trust my judgment yet.

What are the arguments for and against each position? What bad things can happen if we use a bigint, and what are the dangers of reinventing the ~~wheel~~ autoincrementing functionality? Is there a third solution which is better than either one? What could her reasons be for wanting to avoid an inflation of ID face values? I'm interested in hearing about pragmatic reasons too – maybe bigint IDs work in theory, but cause headaches in practice?

The application is not expected to handle very large amounts of data. I doubt that it will reach 10 000 actual records within the next few years.

If it makes any difference, we are using Microsoft SQL server. The application is written in C# and uses Linq to SQL.

Update

Thank you, I found the existing answers and comments interesting. But I'm afraid you misunderstood my question, so they contain what I wanted to know.

I'm not really concerned about the real reason for the high IDs. If we can't find it on our own, I could ask a different question. What I'm interested in is to understand the decision process in this case. For this, please assume that the application will be writing 1000 records per day, then deleting 9999 of them. I'm almost sure this is not the case, but this is what my boss believed when she made her request. So, under these hypothetical circumstances, what would be the pros and cons of either using bigint or writing our own code which will assign IDs (in a way which reuses the IDs of already deleted records, to ensure there are no gaps)?

As for the actual reason, I strongly suspect that this is because we once wrote code to import data from another database, as a proof of concept that a later migration can be done to a certain extent. I think my colleague actually created several thousand records during the import and later deleted them. I have to confirm if this was actually the case, but if it is, there is not even need for action.

Best Answer

Without seeing code, it is pretty hard to say conclusively what is happening. Although, most likely the IDENTITY value is being cached, causing gaps in the value after SQL Server is restarted. See https://stackoverflow.com/questions/17587094/identity-column-value-suddenly-jumps-to-1001-in-sql-server for some good answers and info about that.

A simple INT field can hold values up to 2,147,483,647. You can actually start the identity value at -2,147,483,648, giving a full 32 bits of values. 4 Billion distinct values. I doubt very much you're going to run out of values to use. Assuming your application is consuming 1,000 values for each actual row added, you'd need to be creating nearly 12,000 rows per day every day to run out of IDs in 6 months assuming you started the IDENTITY value at 0, and were using an INT. If you were using a BIGINT, you would have to wait 21 million centuries before you ran out of values if you wrote 12,000 rows per day, consuming 1,000 "values" per row.

Having said all that, if you wanted to use BIGINT as the identity field data type, there is certainly nothing wrong with that. That'll give you for all intents-and-purposes, a limitless supply of values to use. The performance difference between an INT and a BIGINT is practically non-existent on modern 64-bit hardware, and highly preferable over for-instance using NEWID() to generate GUIDs.

If you wanted to manage your own values for the ID column, you could create a key table, and provide a pretty bulletproof way of doing that using one of the methods shown in the answers on this question: Handling concurrent access to a key table without deadlocks in SQL Server

The other option, assuming you're using SQL Server 2012+, would be to use a SEQUENCE object to get ID values for the column. However, you'd need to configure the sequence to not cache values. For example:

CREATE SEQUENCE dbo.MySequence AS INT START WITH -2147483648 INCREMENT BY 1 NO CACHE;

In answer to your boss' negative perception of "high" numbers, I would say what difference does it make? Assuming you use an INT field, with an IDENTITY, you could in fact start the IDENTITY at 2147483647 and "increment" the value by -1. This would make absolutely no difference to the memory consumption, performance, or disk space used since a 32 bit number is 4 bytes, no matter if it is 0 or 2147483647. 0 in binary is 00000000000000000000000000000000 when stored in a 32-bit signed INT field. 2147483647 is 01111111111111111111111111111111 - both numbers take precisely the same amount of space, both in memory, and on disk, and both require precisely the same amount of CPU operations to process. It is far more important to get your application code designed correctly than to obsess about the actual number stored in a key field.

You asked about the pros and cons of either (a) using a larger-capacity ID column, such as a BIGINT, or (b) rolling your own solution to prevent ID gaps. To answer these concerns:

BIGINT instead of INT as the data-type for the column in question. Using a BIGINT requires double the amount of storage, both on-disk, and in-memory for the column itself. If the column is the primary key index for the table involved, each and every non-clustered index attached to the table will also store the BIGINT value, at twice the size of an INT, again both in-memory and on-disk. SQL Server stores data on disk in 8KB pages, where the number of "rows" per "page" depends on the "width" of each row. So, for instance, if you have a table with 10 columns, each one an INT, you'd be approximately able to store 160 rows per page. If those columns where instead BIGINT columns, you'd only be able to store 80 rows per page. For a table with a very large number of rows, this clearly means I/O required to read and write the table will be double in this example for any given number of rows. Granted, this is a pretty extreme example - if you had a row consisting of a single INT or BIGINT column and a single NCHAR(4000) column, you'd be (simplistically) getting a single row per page, whether you used an INT or a BIGINT. In this scenario, it would not make much appreciable difference.
Rolling your own scenario to prevent gaps in the ID column. You'd need to write your code in such a way that determining the "next" ID value to use does not conflict with other actions happening to the table. Something along the lines of SELECT TOP(1) [ID] FROM [schema].[table] naively comes to mind. What if there are multiple actors attempting to write new rows to the table simultaneously? Two actors could easily obtain the same value, resulting in a write-conflict. Getting around this problem requires serializing access to the table, reducing performance. There have been many articles written about this problem; I'll leave it to the reader to perform a search on that topic.

The conclusion here is: you need to understand your requirements and properly estimate both the number of rows, and the row width, along with concurrency requirements of your application. As usual, It Depends™.

Best Answer

Related Solutions

Sql-server – Dropping and recreating indexes for performance reasons

Postgresql – Creating sequence for longer numeric values

Related Question