Advanced design question related to compound primary keys

database-design

As a database design expert, I had a discussion with an academic person about database design and I need some arguments on why some design is bad or not normalized or bad practice. As we cannot discuss here or have opinions, I'm just asking why something is wrong. A possible answer would be to show a clear violation of normalization rules or a pointer to a commonly accepted design standard. Or even an answer that this cannot be decided based on clear rules.

So, let's assume we have three tables. First is a Client table with values C1 and C2. Then we have a Users table with users A, B, C, D, and users A and B belong to Client C1, while users C and D belong to Client C2, so the Users table has a foreign key to the Client table. Then we have a Products table with values 1, 2, 3. Now we need to connect Products with Users (n:n), so we need a connection table.

If we list all possible values for the planned connection table, we get (Product, User, Client the user belongs to):

1 A C1
2 A C1
3 A C1
1 B C1 *
2 B C1 *
3 B C1 *
1 C C2
2 C C2
3 C C2
1 D C2 *
2 D C2 *
3 D C2 *

In my experience, where I lack of proof, every table should have a single primary key, not related to any content. The (wrong) design I was confronted with, made the connection table to not only have the Product-ID and User-ID, but additionally store the redundant field Client-ID (which could be looked up in Users table). Argument was that this is the only way to ensure that products are not connected twice to users of the same client (the values with the star in the above list would not be allowed). This can be done by adding a unique key on the connection table fields Product-ID and Client-ID, but it requires to have the Client-ID be present in that table, although it is already defined in the Users table and could be looked up there.

So why is this wrong? Only because of the redundant data (normalization rule)? How can we otherwise ensure then that no duplicates are stored (only by application logic or triggers)?

In a different scenario (but same underlying question), let's assume we have a grid with data, like an Excel sheet. To store this data we would have a table with the data per cell and foreign keys to the columns/rows table. The columns and rows are not different, so both columns and rows are stored in the same lookup table. This ColumnRow table had in the bad design a compound primary key (two fields), so the foreign key in the grid table would need both fields for each foreign key, requiring a total of 4 fields to reference the row and column. Now there's a restriction that the first part of the two foreign keys must be identical for both row and column, so only the "common" first part was stored ensuring this. If I design this with normal (single, not compound) foreign keys, we cannot ensure that both row and column belong to the same higher-level thing. So this makes it even more difficult for me to argue for single primary keys.

So in general, is there no generic consent to have single (non-compound) and data-unrelated primary keys? Or is this all just opinion-based? Is there expert (reference) literature about this subject that I could point to?

Best Answer

This is opinion based, or at least quite subjective in the real world with the best solution varying depending on a number of factors the relative importance of each different people with disagree on, unfortunately.

The question is whether you should ever create a surrogate key when there is an apparently perfectly suitable candidate key (compound or not) in the existing data. http://en.wikipedia.org/wiki/Surrogate_key has a fuller run down on the advantages and disadvantages, but I tend to err on the side of surrogate keys (integers or UUIDs depending on circumstance) in such circumstances for the following reasons:

Storage efficiency: The PK values are going to be found in all tables that refer to it in an FK, and in any indexes defined over those FKs: a surrogate key is likely to be smaller (at 4 bytes for a 32-bit integer, 8 (64 bits) if you need a larger range than that, 16 for a UUID) than any compound candidate key.
Data Consistency and Changes: a primary key should be unique, immutable, and always known, but many of the things people often assume fit those rules actually don't: people can change names, due to mistakes elsewhere there are people with duplicate identification numbers (such as National Insurance Numbers in the UK (note: you should not use NINos as a key for other reasons too)) so at some point the affected people might have that identifying number changed, and so on. ON UPDATE CASCADE or equivalent can help a lot in the changes scenario but that is not supported by all DMBSs (and can be very expensive depending on your data size).
Later Design Changes: No matter how immutable your client says the data and the rules around it are, they will at some point need to change that. Changes to the assumptions upon which you have chosen a natural candidate key to be the primary key for a table can have a much more significant effect on your design than if you had used a surrogate key.
Consistency in design/coding style, which can help understanding and maintainability later.

This is not hard and fast: there are circumstances where I'll prefer a natural candidate key over a surrogate key.

One thing you must remember when talking to academics is that they are usually talking from the point of view of having a perfect implementation of the theory they are talking about as a starting point, and there is absolutely no relational database implementations out there that can be considered to be that so there are practical considerations that an academic discussion may completely ignore due to this sort of "perfect world" assumption (in theory there is no difference between theory and practise, but in practise there usually is). I don't intend this to be disparaging to the academics here: I'm just meaning that we need to be very careful to define the context of such discussion to avoid talking at crossed purposes with us both being completely correct within the confines of the context we think we are talking about but quite wrong within the context the other party is framing the discussion in.

Related Solutions

In SQL, is it composite or compound keys

Both composite key and compound key describe a candidate key with more than one attribute. According to the Relational Database Dictionary (C.J.Date) they mean the same thing.

In ER modelling the term "compound key" also has a more specific meaning. It means a key whose constituent attributes are references to keys in other entities - i.e. a compound key forms an identifying relationship. For most purposes, this isn't an especially useful or important concept so the terms composite/compound are often treated as interchangeable. It's probably best to stick to "composite key" unless you are referring specifically to the ER modelling concept of a compound key.

Sql-server – SQL Server Primary key / clustered index design decision

You are correct to separate "clustered index" from "primary key":

A clustered index is the organisation of data on disk is better if
- narrow
- numeric
- increasing (strictly monotonic)
The primary key identifies a row

Note: GUIDs make poor clustering keys

In this case, with the surrogate column, the table has 2 candidate keys:

ProductHistoryID
ProductNo + CreatedDateTime

Assumed convention states that the ProductHistoryID becomes the PK, but you can leave the PK on (ProductNo, CreatedDateTime): it will just be non-clustered. Which leads to indexes:

clustered index should be on ProductHistoryID
unique non-clustered index on (ProductNo, CreatedDateTime)

Example

CREATE TABLE Product (
    ProductHistoryID int NOT NULL IDENTITY (1,1) NOT NULL,
    ProductNo ...
    CreatedDateTime ...

then you a choice of

    CONSTRAINT PK_Product PRIMARY KEY CLUSTERED (ProductHistoryID)
    CONSTRAINT UQ_Product UNIQUE NONCLUSTERED (ProductHistoryID)

    CONSTRAINT PK_Product PRIMARY KEY NONCLUSTERED (ProductNo, CreatedDateTime)
    CONSTRAINT PK_Product UNIQUE CLUSTERED (ProductHistoryID)

Also, the pattern you have is a "type 2 Slowly Changing Dimension"

Best Answer

Related Solutions

In SQL, is it composite or compound keys

Sql-server – SQL Server Primary key / clustered index design decision

Related Question