Performance of multi field primary key or ‘contrived’ ‘semi artificial’ key

primary-key

This is not a question regarding the benefits or otherwise of using an artificial auto increment key in any given table against using a multi-field 'primary key'. That discussion (or argument) can easily be found and decisions made by whoever would like to search for them.

This question is more about performance of the keys (or lack thereof)

I work as a DB manager, and when I create my tables I attempt to use a 'natural' key for the table. Often times this comes out as being a set of 2,3, or sometimes 4 fields that act as the primary key for the given table. More often than not these fields are Varchar in nature, but short (10 or 15 characters in length at most). Personally I try to keep them shorter!

My question is this.

Imagine I have a table that holds demographic data.
The only way I can ensure that I have uniqueness in each row is to use the fields for
FirstName FamilyName DateOfBirth PlaceOfBirth

(You may wonder why I include 'place of birth', I am aware of another individual (who used to live nearby – same phone number, different dialing code) with whom I shared all these details (I assume that the PlaceOfBirth was different, but I guess I could have used MothersMaidenName 😉 )

so now I have an interesting problem.

I could use a much shorter field that is created from concatenating the info in the 4 main fields
example: DateOfBirth first 2 characters of FirstName first 2 characters of FamilyName first 2 character of PlaceOfBirth

My question is this.

At what point would the concatenation of field provide a performance improvement over using the fields directly, ie how many columns.

I know from searching that most DBMS's have a 'thoretical maximum size limit' dependent on the B-Tree that is created. I am assuming that I don't hit this limit in terms of the length/size of the primary key.

My reason for considering using this type of 'contrived' key is:
The information in the concatenated column is most likely sufficient to be able to identify the record without the need to extract all the primary key fields (would this be better for performance or no different compared to using all the 4 primary key fields?)

This is obviously a fairly 'theoretical' question, but I have considered doing this concatenation on a table that end up with 4 varchar fields, and it was obvious that uniqueness would be described by using just a shortened version. Obviously there is an effort to create this field in the first place, but in others opions would this effort be worthwhile, and at what point would it become more interesting.

I have searched for this but I have never found this question asked directly, it allays comes out as a 'natural' or 'artificial' primary key discussion.

Of course if this feels like a 'natural' or 'artificial' key discussion feel free to say so. My feeling is that this 'contrived' key would proffer the advantages of a both. Has anyone used this idea in a real world solution?

Thanks in advance for your thoughts.

David

Edit.
I just found this thread

https://stackoverflow.com/questions/3735390/best-primary-key-for-storing-urls

It seems to cover similar ground, I must admit I hadn't thought of 'hashing' my columns together (mainly because they are short by nature), but I do like the idea. I guess you could do that and hash the whole row !

Edit2.

I've returned to this question just to see if there has been any change to the answers, or extra comments. I've decided to accept a response, but would like to note that I have found all the responses helpfull in the terms of the discussion.

Best Answer

I'll answer obliquely...

The natural key is always the natural key and should be enforced with a unique constraint or index. This is the "primary key" that flows from your modelling phase.

The choice of an auto-number/identity surrogate key matters at implementation phase because there are good and bad choices for your clustered index (example: SQL Server, Sybase, MySQL InnoDB, Oracle IOT).

That is, primary key is orthogonal to your clustered index: don't confuse the two issues

I'd suggest using a contrived key adds no value over using an auto-number/identity column in this respect. You lose data from the natural key, probably won't be unique, is just as opaque.

FWIW, I use surrogate keys and composite keys when I need too:

  • Some natural keys are useful in their own right: ISO currency and country codes
  • A table with no secondary (non-clustered) indexes and no child table doesn't benefit from a surrogate key
  • If you have parent-child-grandchild, then I usually need to join parent-grandchild: with composite keys I can do so directly. Simpler JOINs, simpler indexes

Note: this assumes that every table requires a clustered index

Related on dba.se: SQL Server Primary key / clustered index design decision