Advanced design question related to compound primary keys

database-design

As a database design expert, I had a discussion with an academic person about database design and I need some arguments on why some design is bad or not normalized or bad practice. As we cannot discuss here or have opinions, I'm just asking why something is wrong. A possible answer would be to show a clear violation of normalization rules or a pointer to a commonly accepted design standard. Or even an answer that this cannot be decided based on clear rules.

So, let's assume we have three tables. First is a Client table with values C1 and C2. Then we have a Users table with users A, B, C, D, and users A and B belong to Client C1, while users C and D belong to Client C2, so the Users table has a foreign key to the Client table. Then we have a Products table with values 1, 2, 3. Now we need to connect Products with Users (n:n), so we need a connection table.

If we list all possible values for the planned connection table, we get (Product, User, Client the user belongs to):

1 A C1
2 A C1
3 A C1
1 B C1 *
2 B C1 *
3 B C1 *
1 C C2
2 C C2
3 C C2
1 D C2 *
2 D C2 *
3 D C2 *

In my experience, where I lack of proof, every table should have a single primary key, not related to any content. The (wrong) design I was confronted with, made the connection table to not only have the Product-ID and User-ID, but additionally store the redundant field Client-ID (which could be looked up in Users table). Argument was that this is the only way to ensure that products are not connected twice to users of the same client (the values with the star in the above list would not be allowed). This can be done by adding a unique key on the connection table fields Product-ID and Client-ID, but it requires to have the Client-ID be present in that table, although it is already defined in the Users table and could be looked up there.

So why is this wrong? Only because of the redundant data (normalization rule)? How can we otherwise ensure then that no duplicates are stored (only by application logic or triggers)?

In a different scenario (but same underlying question), let's assume we have a grid with data, like an Excel sheet. To store this data we would have a table with the data per cell and foreign keys to the columns/rows table. The columns and rows are not different, so both columns and rows are stored in the same lookup table. This ColumnRow table had in the bad design a compound primary key (two fields), so the foreign key in the grid table would need both fields for each foreign key, requiring a total of 4 fields to reference the row and column. Now there's a restriction that the first part of the two foreign keys must be identical for both row and column, so only the "common" first part was stored ensuring this. If I design this with normal (single, not compound) foreign keys, we cannot ensure that both row and column belong to the same higher-level thing. So this makes it even more difficult for me to argue for single primary keys.

So in general, is there no generic consent to have single (non-compound) and data-unrelated primary keys? Or is this all just opinion-based? Is there expert (reference) literature about this subject that I could point to?

Best Answer

This is opinion based, or at least quite subjective in the real world with the best solution varying depending on a number of factors the relative importance of each different people with disagree on, unfortunately.

The question is whether you should ever create a surrogate key when there is an apparently perfectly suitable candidate key (compound or not) in the existing data. http://en.wikipedia.org/wiki/Surrogate_key has a fuller run down on the advantages and disadvantages, but I tend to err on the side of surrogate keys (integers or UUIDs depending on circumstance) in such circumstances for the following reasons:

  • Storage efficiency: The PK values are going to be found in all tables that refer to it in an FK, and in any indexes defined over those FKs: a surrogate key is likely to be smaller (at 4 bytes for a 32-bit integer, 8 (64 bits) if you need a larger range than that, 16 for a UUID) than any compound candidate key.
  • Data Consistency and Changes: a primary key should be unique, immutable, and always known, but many of the things people often assume fit those rules actually don't: people can change names, due to mistakes elsewhere there are people with duplicate identification numbers (such as National Insurance Numbers in the UK (note: you should not use NINos as a key for other reasons too)) so at some point the affected people might have that identifying number changed, and so on. ON UPDATE CASCADE or equivalent can help a lot in the changes scenario but that is not supported by all DMBSs (and can be very expensive depending on your data size).
  • Later Design Changes: No matter how immutable your client says the data and the rules around it are, they will at some point need to change that. Changes to the assumptions upon which you have chosen a natural candidate key to be the primary key for a table can have a much more significant effect on your design than if you had used a surrogate key.
  • Consistency in design/coding style, which can help understanding and maintainability later.

This is not hard and fast: there are circumstances where I'll prefer a natural candidate key over a surrogate key.

One thing you must remember when talking to academics is that they are usually talking from the point of view of having a perfect implementation of the theory they are talking about as a starting point, and there is absolutely no relational database implementations out there that can be considered to be that so there are practical considerations that an academic discussion may completely ignore due to this sort of "perfect world" assumption (in theory there is no difference between theory and practise, but in practise there usually is). I don't intend this to be disparaging to the academics here: I'm just meaning that we need to be very careful to define the context of such discussion to avoid talking at crossed purposes with us both being completely correct within the confines of the context we think we are talking about but quite wrong within the context the other party is framing the discussion in.