When should a primary key be meaningful

candidate-keydatabase-designprimary-key

Excuse the looseness of any definitions here, but I am trying to explore a simple concept.

A Primary Key uniquely identifies a row. There are possibly other columns in a table with unique values, so they can also uniquely identify a row (Candidate Keys), but the Primary Key is the one designated for the task.

Properties which make a Primary Key more useful include:

guaranteed to be unique; some other unique column values may be duplicated as the table grows
unlikely to change; although Foreign Key constraints can include cascading, it is better not to require this
will not be recycled; a row which is deleted for some reason should not free the old PK value to be reused

For these reasons, I generally recommend that the Primary Key have no intrinsic value, so there would never be a reason to change or recycle a value. That is, it should be otherwise meaningless.

I have seen Primary Keys which include some sort of code, such as a client code based on the name. The obvious problems are (a) if the client name changes then the PK should change too, and (b) there is too much of a risk of conflict with clients with similar names.

A half exception is the use of an auto-incremented number, which has a minor meaning of a sequence number. However, it is still stable.

The question is, under what circumstances, if ever, is preferable to use a primary key with some other real meaning? That is, is there anything wrong with the advice that the PK should be arbitrary, and you can generally get by with a sequence number?

Best Answer

under what circumstances, if ever, is preferable to use a primary key with some other real meaning? (emphasis added)

Given that the focus of this question is "preferable" and not "acceptable", and accepting that this is still a highly subjective topic, I will say that I cannot think of a situation where it is best for the system to have a truly natural key for a variety of reasons (most of which has been said before in other answers that Paul linked to in a comment on the question):

What is thought to be unique is not always unique (e.g. Social Security Numbers / SSNs in the U.S.)
Sometimes things change, either in value or uniqueness (we don't control the external world)
Even when something should be "stable" in value and uniqueness (e.g. SKU, perhaps), can the incoming value be guaranteed to be correct? Humans often mistype stuff doing data entry. There are also bugs in export processes that might cause values in a file imported by your system to be incorrect. There are also bugs in other systems that feed data into yours that can allow for the data itself to not be entirely correct, even if their export process worked correctly.

I emphasize "truly" because there are two situations where I prefer to not have a new surrogate key:

Bridge Tables

(or whatever you like to call tables used only, or mainly, to represent many-to-many relationships)

Thing                          ThingXTag                Tag
------                         ---------                ---
ThingID INT AutoMagic PK --->  ThingID INT PK, FK      
Stuff   SomeType               TagID   INT PK, FK  <--- TagID   INT AutoMagic PK
                                                        TagName VARCHAR

When modeling a bridge table (a table which doesn't exist in the logical model but is needed in the physical model), the PK there should be the existing Primary Key columns of the tables being related via this table. This allows for enforcing the proper uniqueness and non-NULL-ness of the values without needing a separate unique index / constraint. In the rare case of needing to foreign key to this relationship, it will:

be meaningful in that each of the key columns will actually point back to the original source tables without needing yet another join, and
prevent someone or something from updating the key columns that form the relationship between the two primary tables without updating the value being linked to via a foreign key.

WackyTable                           ThingXTag
----------                           ---------
WackyTableID INT AutoMagic PK
ThingID      INT FK            --->  ThingID     INT PK, FK (to Thing.ThingID)
TagID        INT FK            --->  TagID       INT PK, FK (to Tag.TagID)
AttributeX   VARCHAR                 InsertDate  DATETIME
InsertDate   DATETIME

I have worked on a system where these bridge tables had their own auto-incrementing, surrogate key PKs, and the single column surrogate key of the bridge table was referenced in other tables via FK:

WackyTable                           ThingXTag
----------                           ---------
WackyTableID INT AutoMagic PK        
ThingXTagID  INT FK            --->  ThingXTagID INT AutoMagic PK
AttributeX   VARCHAR                 ThingID     INT FK (to Thing.ThingID)
InsertDate   DATETIME                TagID       INT FK (to Tag.TagID)
                                     InsertDate  DATETIME

It was a horrible, confusing mess that we wasted way too much time on for debugging, etc.

Sibling Tables

These are tables that are truly a single entity and thus have a 1-to-1 relationship. They are only split into two (or more, as is needed) tables for performance reasons. I have done this with tables having 1 million (or more) rows where it was either very wide, or if it was moderately wide and there were columns that were either not used very frequently or were string of over 50 bytes. Stuff like that. This keeps the core properties of the entity in a narrower table that fits more rows on each data page.

In these cases, the "sibling" table is at the exact same level as the initial table and should have the same PK. There is no use in giving it an auto-incrementing surrogate key since each row has what amounts to a natural key from the initial table.

Product                            ProductProperty
-------                            ---------------
ProductID  INT AutoMagic PK  --->  ProductID        INT PK, FK (to Product.ProductID)
Name       VARCHAR                 ShortDescription VARCHAR
SKU        VARCHAR                 SomethingElse    SomeType
Quantity   INT                     UpdateDate       DATETIME
CreateDate DATETIME
UpdateDate DATETIME

To be clear, I am speaking in terms of the physical model, not the conceptual model. I assumed that the focus of this question was the physical model since it is framed in the context of issues that do not exist conceptually: surrogate keys, issues dealing with usage of the primary key value, etc. With that in mind, I was not implying that natural keys should not be stored and used for identification. On the contrary, natural keys are great "alternate keys" and should have unique constraints / indexes placed on them. However, the idealism of the conceptual model does not always translate directly to the physical model. Data integrity (i.e. the stability and reliability of the data model) is a top, if not the top, priority of the physical model. And so practical considerations, such as using surrogate keys, must be made to ensure that this goal is met and not compromised. Meaning, if you have SSNs or SKUs, etc, then absolutely store them in a column that has a unique constraint on it, and have the system do a lookup on that value since auto-generated numbers shouldn't be used externally anyway. Users don't need to know the auto-generated ID number of a record: they should pass in a value that they do know (e.g. email address as a lookup for UserID / CustomerID, flight confirmation codes in combination with flight dates, etc) and the system should translate that into the auto-generated value that it uses from that point forward.

Yes, the problems noted at the beginning of this answer are still potential issues when using natural keys as alternate keys. However, the difference is that the problem is isolated to (usually) just one table. If someone made a mistake and created a unique index on "flight locator", it might be a while before they get a violation. But once they do, it is easy enough to drop that unique index and recreate it to include the flight dates as well. Or, if you change your email address (often used as the Login) on a system and get an error because it was in use by someone else years ago (legitimately), that can most likely be handled by support without any impact / risk to existing related records. In both cases the rest of the data model is untouched for the necessary changes.

Again, this is a pragmatic approach to:

minimizing potential data loss that can happen when migrating a primary key that has been referenced by one or more foreign keys (and minimizing the scope of the project also reduces the maintenance window :-). While not all PKs (or Unique Constraints/Indexes) have FKs, using a surrogate should certainly reduce the number of columns that have such dependencies
making the system as durable, resilient, and efficient as possible. The physical model is the "in practice" to the "in theory" of the conceptual model. And we already make quite a few adjustments and considerations given that the conceptual model doesn't care about implementation. Yet we make choices to utilize vendor-specific features, for performance (e.g. denormalization, lookup tables, "sibling" tables as noted above, etc), to create the "bridge" tables (as noted above), and so on.

I don't know how many systems used SSNs (Social Security Numbers in the U.S.) as PKs, but for any that did, some of them (many perhaps) may have avoided issues with them not being as unique as they should have been. But, none of those systems were able to avoid changes over the years regarding the need to handle them much more securely. Systems treating SSNs as an alternate key require very little development time to switch over to encrypting those values, and the system requires little downtime (or none) to make the changes at the data layer. Given that we all have backlogs of projects that we will likely never get to, businesses tend to prefer that these annoying yet inevitable changes will cost them 5 hours instead of 20 - 40 (don't forget that changes need to be tested, and hence the scope of the changes also directly impacts how much QA time is needed for the project).

To be explicit, there are some scenarios where it is "acceptable" to have natural keys, though I don't think I would go so far as to say "preferable".

State / Region and Country codes: If you use codes maintained by the International Organization for Standardization (ISO) (e.g. "US" = United States, "FL" = Florida, etc) then these are probably reliable enough to use, and I have used them as they are short (i.e. not bad for performance) and human-readable. These codes are also used often-enough in a variety of other ways, even outside of computer systems, that people have a general familiarity with them, even ones that don't initially make as much sense to some folks (e.g. "DE" = Germany might not be intuitively obvious to those who are unaware that "Germany" == "Deutschland" in German).
Internal lookup value codes: You can't control external sources, but (hopefully) you are in control of your own system. If your system has department codes, status codes, etc that are used internally, then it should be fine to come up with short codes for them (2 - 4 bytes). At 4 bytes it would be using the same amount of space as an INT and if using a binary Collation (one ending in _BIN2, or even _BIN, but _BIN2 is preferred) then it should compare just as fast. Having relatively meaningful values for such codes can make support / debugging easier. However, you still get into the situation where over time, department names, etc might change and the codes might not be meaningful anymore.

Best Answer

Related Solutions

Update primary key

Primary key should be chosen so as attributes never change

Related Question