The advantage of using numeric primary keys in a relational database

database-designddlprimary-keyrdbms

Many tables in a relational database system (in my admittedly limited experience) have this sort of structure (Oracle syntax as that's what I'm used to):

create table widgets (
    widget_id number primary key,
    widget_name varchar2(20) not null,
    widget_colour varchar2(10)
);
create unique index ix01_widgets
    on table widgets (widget_name);

So there are two indices: one for the primary key, widget_id, and one for the widget_name.

Can anyone explain the advantage of having a numeric primary key over just using widget_name as the primary key? The latter has the advantage of making the contents of child tables more readable, e.g.:

create table widget_characteristics (
    widget_name varchar2(20) not null,
    characteristic_name varchar2(20) not null,
    characteristic_value varchar2(100),
    constraint pk_widget_characteristics
        primary key (widget_name, characteristic_name),
    constraint fk01_widget_characteristics
        foreign key (widget_name)
        references widgets (widget_name)
);

Otherwise (if using a generated numeric primary key) I would need to create views to see the denormalised data.

Is it something about the "indexability" of numbers versus strings? I can understand the issue of index size if the identifier (i.e. widget_name) is long, but if it's short then is that still an issue? For tables with multi-column primary keys I can understand the logic to having a generated numeric PK, but for "simple" tables I'm not clear.

Given the ubiquity of numeric primary keys I assume there is a fundamental reason for this: I'm just not sure what it is!

Best Answer

Size is one consideration, certainly. Consider not only the size of the index in the widgets table, but also that primary keys show up in other tables as foreign keys. In some systems, even short strings are going to take up more space than integers (e.g., MSSQL: VARCHAR is a byte per character plus 2 bytes, so at only two characters you're already as large as an integer).

Almost all numeric primary keys I have seen are surrogate keys, implemented via series, sequence, auto_increment, IDENTITY or whatever the database engine's native method is for generating values itself. I suspect this is a leading reason for the ubiquity of the numeric primary key. One advantage to surrogate keys is that they have no business meaning. Since business meaning can change over time, using a key without business meaning helps ensure that the primary key is static.

If your widgets have some sort of industry-standard identifier (like the auto industry's VIN, publishing's ISBN, UPCs and so forth), that's probably the best choice for your primary key. My concern in using widget_name is that attribute's immutability. Will it ever change? How do you know it will never change--did Sales tell you that? :)

The whole surrogate vs. natural key issue is nearly a religious debate, and is sort of tangential to your question. I would say if you have a natural key that is static, minimal, and unique, use it. Otherwise, consider a surrogate key (which is likely going to be numeric).