Sql-server – When to use identity increment

identitysql server

Many times I face these three choices to work with IDs:

Using an auto-increment identity column?
Suggesting a new codification to use (eg, for students: first 4 position is the year when the student joins the university, 5th position is 0 for females 1 for males, the last 4 position is a sequential number)
Don't add any extra column and use only a unique information that already exists (eg: name (of course must be sure that's unique (there are no two possible entries with the same name)))

Question is: which choice do I make, and according to what criteria?

Best Answer

This is an ongoing debate that many face, and experts will give good arguments for either of these. What your question boils down to is a surrogate vs. natural key (and option two seems to be a half-surrogate half-natural key).

As for your third option, oftentimes I see people do whatever they can to find a unique candidate for a primary key constraint. Take you for instance: name is your example key. That is a notoriously bad candidate for a key, as there is no uniqueness guaranteed with that. Think "John Smith". Not only that, but having that as a key will bloat your foreign key relationship, as when you go to reference that primary key in a relationship table then you will have that same data type and column width. Think nvarchar() as opposed to a narrow data type (i.e. int).

As for your second option, it sounds like you have the ability to create a unique value (by your description, able to fit in an int column). Where would you plan on doing this? This would either be logic in the database or in the application. In my opinion, I like to keep entities like keys isolated to the data tier. This "id" would most likely be used for normalized data. That should be an abstraction to the application (in my opinion). So now my reasoning brings me to either a database-generated unique value (your option two) or an identity value (your option one). Those both have the same result (as far as I can see it), therefore why go through the trouble and the development to generate your own when you can just have an IDENTITY column?

Again, others may and probably will have different and equally objective answers for going with another alternative, but my vote would be for the IDENTITY column for a unique, surrogate key.

Related Solutions

SQL Server – Using IDENTITY Column with Increment -1

Going backwards just feels wrong to me.

With only two data centers you could also implement identity ranges. Unless you cycle through identity values at an alarming rate, there is no reason you can't have:

-- Data center 1
CREATE TABLE dbo.Table
(
  ID INT IDENTITY(1,1) PRIMARY KEY
  -- , ...
);

-- Data center 2
CREATE TABLE dbo.Table
(
  ID INT IDENTITY(1000000000,1) PRIMARY KEY
  -- , ...
);

This would allow to generate 1 billion (well, 999,999,999) values in data center 1 before there is any danger in collision with data center 2. And you could add a CHECK constraint in data center 1 to prevent overlapping values, depending on how you prioritize errors vs. duplicates. You could also implement a recurring job that periodically checks how close you are to the lower bound of the other data center, if you're concerned that you'll really generate over a billion values in either data center in the app's lifetime (never mind yours).

If ~1 billion is not enough, there are two other alternatives to give more room to breathe:

-- Data center 1
  ID INT IDENTITY(-2147483648,1)

-- Data center 2
  ID INT IDENTITY(1,1)

This would give each data center over 2 billion non-overlapping values, with no danger of collision until data center 1 approached 0. And if that's still not enough:

-- Data center 1
  ID BIGINT IDENTITY(-9223372036854775808,1)

-- Data center 2
  ID BIGINT IDENTITY(1,1)

Or if you want to keep all values > 0, you could split the positive range roughly in half (you can be a little more precise than this if you're more pedantic than me).

-- Data center 1
  ID BIGINT IDENTITY(1,1)

-- Data center 2
  ID BIGINT IDENTITY(4611600000000000000,1)

I don't even know how to say that number, but it's ginormous. And in reality you'd have to work extremely hard on a very fast computer to use up all those values by the time your great great great great great grandchild graduates from college. And if you are on Enterprise Edition you can use data compression so that you don't pay the hit for all 8 bytes, at least in data center 1, until you exceed 2 billion there.

In a system I managed I did this a slightly different way - we had multiple web servers with Express instances that were responsible for generating ID numbers that needed to be unique in the enterprise. So we just set up a sequence generator on each machine (they didn't need to actually store the value) using a BIGINT IDENTITY column. We had < 9 servers, so they were all seeded like this:

-- WEBAPP0001
ID BIGINT IDENTITY(10000000000,1)

-- WEBAPP0002
ID BIGINT IDENTITY(20000000000,1)

-- WEBAPP0003
ID BIGINT IDENTITY(30000000000,1)

...

When the values were used and then merged in the central system, not only were we guaranteed to not have any duplicates, but it was also easy to instantly recognize which web server they came from (which was sometimes useful in debugging) without introducing any composite key requirements. And we had no concerns that any one web server would ever generate more than 10 billion values.

Many people would go with a GUID in this scenario, but I think there are several strong arguments against that approach.

Sql-server – Problem with a surrogate key

There are a few things of interest here (note: I currently work in this field, so I'm not going to go into too much depth with regards to system functionality and requirements).

Avoid exposing database internal identifiers to the outside world. The field values in an integer IDENTITY column have no meaning outside the database. The absolute farthest they should go is the client-side data access layer.
Normally, students (people) are identified within a jurisdiction by a unique number (i.e., student number) assigned by the jurisdictional authority (i.e., department of education). In some jurisdictions, students (people) need to exist in the system without a unique number because the numbers aren't assigned synchronously as students (people) are registered. I don't know if you need to deal with that or not. If that is the case, students (people) without the unique number are determined to be unique by a business key, such as the combination of first name, last name, gender and birth date. Everyone, though, will have a unique surrogate key value in the database, which you (not the user) will use to uniquely identify a record.
You will have to implement functionality to search for students. Most of the time users of the system will be searching based on a name (or partial name), not a student number. The resulting search should display students to which the user "has access." For example, if the user is a teacher, they should only have access to the students they teach. The search interface will only allow the user to select a single student -- which will give you the internal identifier you need -- before proceeding.

So in summary, I think you're on the right path with the database design, it's just a matter of implementing some extra functionality on the front end for the user to be able to uniquely identify a student.

Best Answer

Related Solutions

SQL Server – Using IDENTITY Column with Increment -1

Sql-server – Problem with a surrogate key

Related Question