Excuse the looseness of any definitions here, but I am trying to explore a simple concept.
A Primary Key uniquely identifies a row. There are possibly other columns in a table with unique values, so they can also uniquely identify a row (Candidate Keys), but the Primary Key is the one designated for the task.
Properties which make a Primary Key more useful include:
- guaranteed to be unique; some other unique column values may be duplicated as the table grows
- unlikely to change; although Foreign Key constraints can include cascading, it is better not to require this
- will not be recycled; a row which is deleted for some reason should not free the old PK value to be reused
For these reasons, I generally recommend that the Primary Key have no intrinsic value, so there would never be a reason to change or recycle a value. That is, it should be otherwise meaningless.
I have seen Primary Keys which include some sort of code, such as a client code based on the name. The obvious problems are (a) if the client name changes then the PK should change too, and (b) there is too much of a risk of conflict with clients with similar names.
A half exception is the use of an auto-incremented number, which has a minor meaning of a sequence number. However, it is still stable.
The question is, under what circumstances, if ever, is preferable to use a primary key with some other real meaning? That is, is there anything wrong with the advice that the PK should be arbitrary, and you can generally get by with a sequence number?
Best Answer
Given that the focus of this question is "preferable" and not "acceptable", and accepting that this is still a highly subjective topic, I will say that I cannot think of a situation where it is best for the system to have a truly natural key for a variety of reasons (most of which has been said before in other answers that Paul linked to in a comment on the question):
I emphasize "truly" because there are two situations where I prefer to not have a new surrogate key:
Bridge Tables
(or whatever you like to call tables used only, or mainly, to represent many-to-many relationships)
When modeling a bridge table (a table which doesn't exist in the logical model but is needed in the physical model), the PK there should be the existing Primary Key columns of the tables being related via this table. This allows for enforcing the proper uniqueness and non-NULL-ness of the values without needing a separate unique index / constraint. In the rare case of needing to foreign key to this relationship, it will:
I have worked on a system where these bridge tables had their own auto-incrementing, surrogate key PKs, and the single column surrogate key of the bridge table was referenced in other tables via FK:
It was a horrible, confusing mess that we wasted way too much time on for debugging, etc.
Sibling Tables
These are tables that are truly a single entity and thus have a 1-to-1 relationship. They are only split into two (or more, as is needed) tables for performance reasons. I have done this with tables having 1 million (or more) rows where it was either very wide, or if it was moderately wide and there were columns that were either not used very frequently or were string of over 50 bytes. Stuff like that. This keeps the core properties of the entity in a narrower table that fits more rows on each data page.
In these cases, the "sibling" table is at the exact same level as the initial table and should have the same
PK
. There is no use in giving it an auto-incrementing surrogate key since each row has what amounts to a natural key from the initial table.To be clear, I am speaking in terms of the physical model, not the conceptual model. I assumed that the focus of this question was the physical model since it is framed in the context of issues that do not exist conceptually: surrogate keys, issues dealing with usage of the primary key value, etc. With that in mind, I was not implying that natural keys should not be stored and used for identification. On the contrary, natural keys are great "alternate keys" and should have unique constraints / indexes placed on them. However, the idealism of the conceptual model does not always translate directly to the physical model. Data integrity (i.e. the stability and reliability of the data model) is a top, if not the top, priority of the physical model. And so practical considerations, such as using surrogate keys, must be made to ensure that this goal is met and not compromised. Meaning, if you have SSNs or SKUs, etc, then absolutely store them in a column that has a unique constraint on it, and have the system do a lookup on that value since auto-generated numbers shouldn't be used externally anyway. Users don't need to know the auto-generated ID number of a record: they should pass in a value that they do know (e.g. email address as a lookup for UserID / CustomerID, flight confirmation codes in combination with flight dates, etc) and the system should translate that into the auto-generated value that it uses from that point forward.
Yes, the problems noted at the beginning of this answer are still potential issues when using natural keys as alternate keys. However, the difference is that the problem is isolated to (usually) just one table. If someone made a mistake and created a unique index on "flight locator", it might be a while before they get a violation. But once they do, it is easy enough to drop that unique index and recreate it to include the flight dates as well. Or, if you change your email address (often used as the Login) on a system and get an error because it was in use by someone else years ago (legitimately), that can most likely be handled by support without any impact / risk to existing related records. In both cases the rest of the data model is untouched for the necessary changes.
Again, this is a pragmatic approach to:
I don't know how many systems used SSNs (Social Security Numbers in the U.S.) as PKs, but for any that did, some of them (many perhaps) may have avoided issues with them not being as unique as they should have been. But, none of those systems were able to avoid changes over the years regarding the need to handle them much more securely. Systems treating SSNs as an alternate key require very little development time to switch over to encrypting those values, and the system requires little downtime (or none) to make the changes at the data layer. Given that we all have backlogs of projects that we will likely never get to, businesses tend to prefer that these annoying yet inevitable changes will cost them 5 hours instead of 20 - 40 (don't forget that changes need to be tested, and hence the scope of the changes also directly impacts how much QA time is needed for the project).
To be explicit, there are some scenarios where it is "acceptable" to have natural keys, though I don't think I would go so far as to say "preferable".
INT
and if using a binary Collation (one ending in_BIN2
, or even_BIN
, but_BIN2
is preferred) then it should compare just as fast. Having relatively meaningful values for such codes can make support / debugging easier. However, you still get into the situation where over time, department names, etc might change and the codes might not be meaningful anymore.